[ENH] make single problem loaders for equal length problems return nu…

…mpy arrays (#109) Changes the `return_type` from `None` to `numpy3d` and removes the very long comments describing possible data structures in unnecessary detail. Part of #42
aeon-toolkit · Feb 28, 2023 · f7e226a · f7e226a
1 parent 015339f
commit f7e226a
Show file tree

Hide file tree

Showing 37 changed files with 388 additions and 550 deletions.
diff --git a/examples/02_classification.ipynb b/examples/02_classification.ipynb
@@ -545,7 +545,7 @@
     "`sktime` offers two other ways of building estimators for multivariate time series problems:\n",
     "\n",
     "1. Concatenation of time series columns into a single long time series column via `ColumnConcatenator` and apply a classifier to the concatenated data,\n",
-    "2. Dimension ensembling via `ColumnEnsembleClassifier` in which one classifier is fitted for each time series column/dimension of the time series and their predictions are combined through a voting scheme. \n",
+    "2. Dimension ensembling via `ColumnEnsembleClassifier` in which one classifier is fitted for each time series column/dimension of the time series and their predictions are combined through a voting scheme.\n",
     "\n",
     "We can concatenate multivariate time series/panel data into long univariate time series/panel using a transform and then apply a classifier to the univariate data:"
    ]
@@ -690,7 +690,7 @@
     "\n",
     "#### HIVE-COTE 2.0 (HC2)\n",
     "The HIerarchical VotE Collective of Transformation-based Ensembles is a meta ensemble that combines classifiers built on different representations. Version 2  combines DrCIF, TDE, an ensemble of RocketClassifiers called the Arsenal and the  ShapeletTransformClassifier. It is one of the most accurate classifiers on the UCR and UEA time series archives.\n",
-    "   \n",
+    "\n",
     "[3] Middlehurst, Matthew, James Large, Michael Flynn, Jason Lines, Aaron Bostrom, and Anthony Bagnall. \"HIVE-COTE 2.0: a new meta ensemble for time series classification.\" Machine Learning (2021)\n",
     "[ML 2021](https://link.springer.com/article/10.1007/s10994-021-06057-9)\n",
     "\n",

diff --git a/examples/02a_classification_multivariate_cnn.ipynb b/examples/02a_classification_multivariate_cnn.ipynb
@@ -39,8 +39,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "X_train, y_train = load_basic_motions(split=\"train\", return_X_y=True)\n",
-    "X_test, y_test = load_basic_motions(split=\"test\", return_X_y=True)\n",
+    "X_train, y_train = load_basic_motions(\n",
+    "    split=\"train\", return_X_y=True, return_type=\"nested_univ\"\n",
+    ")\n",
+    "X_test, y_test = load_basic_motions(\n",
+    "    split=\"test\", return_X_y=True, return_type=\"nested_univ\"\n",
+    ")\n",
     "print(X_train.shape)\n",
     "print(X_test.shape)\n",
     "print(type(X_train.iloc[1, 1]))\n",
@@ -133,5 +137,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 2
-
 }
diff --git a/examples/02b_classification_multivariate_lstmfcn.ipynb b/examples/02b_classification_multivariate_lstmfcn.ipynb
@@ -39,8 +39,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "X_train, y_train = load_basic_motions(split=\"train\", return_X_y=True)\n",
-    "X_test, y_test = load_basic_motions(split=\"test\", return_X_y=True)\n",
+    "X_train, y_train = load_basic_motions(\n",
+    "    split=\"train\", return_X_y=True, return_type=\"nested_univ\"\n",
+    ")\n",
+    "X_test, y_test = load_basic_motions(\n",
+    "    split=\"test\", return_X_y=True, return_type=\"nested_univ\"\n",
+    ")\n",
     "print(X_train.shape)\n",
     "print(X_test.shape)\n",
     "print(type(X_train.iloc[1, 1]))\n",

diff --git a/examples/AA_datatypes_and_datasets.ipynb b/examples/AA_datatypes_and_datasets.ipynb
@@ -189,7 +189,7 @@
     "* structure convention: `obj` must be 2D, i.e., `obj.shape` must have length 2. This is also true for univariate time series.\n",
     "* variables: variables correspond to columns of `obj`.\n",
     "* variable names: the `\"np.ndarray\"` mtype cannot represent variable names.\n",
-    "* time points: the rows of `obj` correspond to different, distinct time points. \n",
+    "* time points: the rows of `obj` correspond to different, distinct time points.\n",
     "* time index: The time index is implicit and by-convention. The `i`-th row (for an integer `i`) is interpreted as an observation at the time point `i`.\n",
     "* capabilities: cannot represent multivariate series; cannot represent unequally spaced series"
    ]
@@ -262,8 +262,8 @@
     "\n",
     "* structure convention: `obj.index` must be a pair multi-index of type `(Index, t)`, where `t` is one of `Int64Index`, `RangeIndex`, `DatetimeIndex`, `PeriodIndex` and monotonous. `obj.index` must have two levels (can be named or not).\n",
     "* instance index: the first element of pairs in `obj.index` (0-th level value) is interpreted as an instance index, we call it \"instance index\" below.\n",
-    "* instances: rows with the same \"instance index\" index value correspond to the same instance; rows with different \"instance index\" values correspond to different instances. \n",
-    "* time index: the second element of pairs in `obj.index` (1-st level value) is interpreted as a time index, we call it \"time index\" below. \n",
+    "* instances: rows with the same \"instance index\" index value correspond to the same instance; rows with different \"instance index\" values correspond to different instances.\n",
+    "* time index: the second element of pairs in `obj.index` (1-st level value) is interpreted as a time index, we call it \"time index\" below.\n",
     "* time points: rows of `obj` with the same \"time index\" value correspond correspond to the same time point; rows of `obj` with different \"time index\" index correspond correspond to the different time points.\n",
     "* variables: columns of `obj` correspond to different variables\n",
     "* variable names: column names `obj.columns`\n",
@@ -297,7 +297,7 @@
     "\n",
     "* structure convention: `obj` must be 3D, i.e., `obj.shape` must have length 3.\n",
     "* instances: instances correspond to axis 0 elements of `obj`.\n",
-    "* instance index: the instance index is implicit and by-convention. The `i`-th element of axis 0 (for an integer `i`) is interpreted as indicative of observing instance `i`. \n",
+    "* instance index: the instance index is implicit and by-convention. The `i`-th element of axis 0 (for an integer `i`) is interpreted as indicative of observing instance `i`.\n",
     "* variables: variables correspond to axis 1 elements of `obj`.\n",
     "* variable names: the `\"numpy3D\"` mtype cannot represent variable names.\n",
     "* time points: time points correspond to axis 2 elements of `obj`.\n",
@@ -376,8 +376,8 @@
     "\n",
     "* structure convention: `obj.index` must be a 3 or more level multi-index of type `(Index, ..., Index, t)`, where `t` is one of `Int64Index`, `RangeIndex`, `DatetimeIndex`, `PeriodIndex` and monotonous. We call the last index the \"time-like\" index.\n",
     "* hierarchy level: rows with the same non-time-like index values correspond to the same hierarchy unit; rows with different non-time-like index combination correspond to different hierarchy unit.\n",
-    "* hierarchy: the non-time-like indices in `obj.index` are interpreted as a hierarchy identifying index. \n",
-    "* time index: the last element of tuples in `obj.index` is interpreted as a time index. \n",
+    "* hierarchy: the non-time-like indices in `obj.index` are interpreted as a hierarchy identifying index.\n",
+    "* time index: the last element of tuples in `obj.index` is interpreted as a time index.\n",
     "* time points: rows of `obj` with the same `\"timepoints\"` index correspond correspond to the same time point; rows of `obj` with different `\"timepoints\"` index correspond correspond to the different time points.\n",
     "* variables: columns of `obj` correspond to different variables\n",
     "* variable names: column names `obj.columns`\n",

diff --git a/examples/feature_extraction_with_tsfresh.ipynb b/examples/feature_extraction_with_tsfresh.ipynb
@@ -80,7 +80,7 @@
     }
    ],
    "source": [
-    "X, y = load_arrow_head(return_X_y=True)\n",
+    "X, y = load_arrow_head(return_X_y=True, return_type=\"nested_univ\")\n",
     "X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
     "print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)"
    ]
@@ -97,9 +97,6 @@
     },
     "jupyter": {
      "outputs_hidden": false
-    },
-    "pycharm": {
-     "name": "#%%\n"
     }
    },
    "outputs": [
@@ -606,7 +603,7 @@
     }
    ],
    "source": [
-    "X, y = load_basic_motions(return_X_y=True)\n",
+    "X, y = load_basic_motions(return_X_y=True, return_type=\"nested_univ\")\n",
     "X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
     "print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)"
    ]
@@ -623,9 +620,6 @@
     },
     "jupyter": {
      "outputs_hidden": false
-    },
-    "pycharm": {
-     "name": "#%%\n"
     }
    },
    "outputs": [

diff --git a/examples/interpolation.ipynb b/examples/interpolation.ipynb
@@ -11,8 +11,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Suppose we have a set of time series with different lengths, i.e. different number \n",
-    "of time points. Currently, most of sktime's functionality requires equal-length time series, so to use sktime, we need to first converted our data into equal-length time series. In this tutorial, you will learn how to use the `TSInterpolator` to do so. "
+    "Suppose we have a set of time series with different lengths, i.e. different number\n",
+    "of time points. Currently, most of sktime's functionality requires equal-length time series, so to use sktime, we need to first converted our data into equal-length time series. In this tutorial, you will learn how to use the `TSInterpolator` to do so."
    ]
   },
   {
@@ -73,7 +73,7 @@
     }
    ],
    "source": [
-    "X, y = load_basic_motions()\n",
+    "X, y = load_basic_motions(return_type=\"nested_univ\")\n",
     "X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
     "\n",
     "steps = [\n",
@@ -133,7 +133,7 @@
     "            )  # here is a problem\n",
     "\n",
     "\n",
-    "X, y = load_basic_motions()\n",
+    "X, y = load_basic_motions(return_type=\"nested_univ\")\n",
     "X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
     "\n",
     "for df in [X_train, X_test]:\n",
@@ -156,7 +156,7 @@
    "metadata": {},
    "source": [
     "# Now the interpolator enters\n",
-    "Now we use our interpolator to resize time series of different lengths to user-defined length. Internally, it uses linear interpolation from scipy and draws equidistant samples on the user-defined number of points. \n",
+    "Now we use our interpolator to resize time series of different lengths to user-defined length. Internally, it uses linear interpolation from scipy and draws equidistant samples on the user-defined number of points.\n",
     "\n",
     "After interpolating the data, the classifier works again."
    ]
@@ -187,7 +187,7 @@
    "source": [
     "from sktime.transformations.panel.interpolate import TSInterpolator\n",
     "\n",
-    "X, y = load_basic_motions()\n",
+    "X, y = load_basic_motions(return_type=\"nested_univ\")\n",
     "X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
     "\n",
     "for df in [X_train, X_test]:\n",

diff --git a/examples/minirocket.ipynb b/examples/minirocket.ipynb
@@ -87,7 +87,7 @@
    },
    "outputs": [],
    "source": [
-    "X_train, y_train = load_arrow_head(split=\"train\", return_X_y=True)\n",
+    "X_train, y_train = load_arrow_head(split=\"train\", return_type=\"nested_univ\")\n",
     "# visualize the first univariate time series\n",
     "X_train.iloc[0, 0].plot()"
    ]
@@ -177,7 +177,7 @@
    },
    "outputs": [],
    "source": [
-    "X_test, y_test = load_arrow_head(split=\"test\", return_X_y=True)\n",
+    "X_test, y_test = load_arrow_head(split=\"test\")\n",
     "X_test_transform = minirocket.transform(X_test)"
    ]
   },
@@ -208,10 +208,7 @@
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
    "source": [
-    "***\n",
-    "\n",
     "## 2 Multivariate Time Series\n",
     "\n",
     "We can use the multivariate version of MiniRocket for multivariate time series input.\n",
@@ -220,15 +217,11 @@
     "\n",
     "Import MiniRocketMultivariate.\n",
     "\n",
-    "**Note**: MiniRocketMultivariate compiles via Numba on import."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
+    "**Note**: MiniRocketMultivariate compiles via Numba on import.\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
   },
   {
    "cell_type": "markdown",
@@ -252,7 +245,7 @@
    },
    "outputs": [],
    "source": [
-    "X_train, y_train = load_basic_motions(split=\"train\", return_X_y=True)"
+    "X_train, y_train = load_basic_motions(split=\"train\")"
    ]
   },
   {
@@ -327,7 +320,7 @@
    },
    "outputs": [],
    "source": [
-    "X_test, y_test = load_basic_motions(split=\"test\", return_X_y=True)\n",
+    "X_test, y_test = load_basic_motions(split=\"test\")\n",
     "X_test_transform = minirocket_multi.transform(X_test)"
    ]
   },
@@ -471,8 +464,8 @@
     "\n",
     "\n",
     "### 4.1 Load japanese_vowels as unequal length dataset\n",
-    "Japanese vowels is a a UCI Archive dataset. 9 Japanese-male speakers were recorded saying the vowels ‘a’ and ‘e’. \n",
-    "The raw recordings are preprocessed to get a 12-dimensional (multivariate) classification probem. The series lengths are between 7 and 29. "
+    "Japanese vowels is a a UCI Archive dataset. 9 Japanese-male speakers were recorded saying the vowels ‘a’ and ‘e’.\n",
+    "The raw recordings are preprocessed to get a 12-dimensional (multivariate) classification probem. The series lengths are between 7 and 29."
    ]
   },
   {
@@ -509,7 +502,7 @@
    "metadata": {},
    "source": [
     "### 4.2 Create a pipeline, train on it\n",
-    "As before, we create a sklearn pipeline. \n",
+    "As before, we create a sklearn pipeline.\n",
     "MiniRocketMultivariateVariable requires a minimum series length of 9, where missing values are padded up to a length of 9, with the value \"-10.0\".\n",
     "Afterwards a scaler and a RidgeClassifierCV are added.\n"
    ]

diff --git a/sktime/benchmarking/tests/test_TSCStrategy.py b/sktime/benchmarking/tests/test_TSCStrategy.py
@@ -16,8 +16,8 @@
 @pytest.mark.parametrize("dataset", DATASET_LOADERS)
 def test_TSCStrategy(dataset):
     """Test strategy."""
-    train = dataset(split="train", return_X_y=False)
-    test = dataset(split="test", return_X_y=False)
+    train = dataset(split="train", return_X_y=False, return_type="nested_univ")
+    test = dataset(split="test", return_X_y=False, return_type="nested_univ")
     s = TSCStrategy(classifier)
     task = TSCTask(target="class_val")
     s.fit(task, train)

diff --git a/sktime/benchmarking/tests/test_orchestration.py b/sktime/benchmarking/tests/test_orchestration.py
@@ -41,7 +41,7 @@ def make_reduction_pipeline(estimator):
 @pytest.mark.parametrize("data_loader", [load_gunpoint, load_arrow_head])
 def test_automated_orchestration_vs_manual(data_loader):
     """Test orchestration."""
-    data = data_loader(return_X_y=False)
+    data = data_loader(return_X_y=False, return_type="nested_univ")
 
     dataset = RAMDataset(dataset=data, name="data")
     task = TSCTask(target="class_val")
@@ -85,7 +85,10 @@ def test_automated_orchestration_vs_manual(data_loader):
 @pytest.mark.parametrize(
     "dataset",
     [
-        RAMDataset(dataset=load_arrow_head(return_X_y=False), name="ArrowHead"),
+        RAMDataset(
+            dataset=load_arrow_head(return_X_y=False, return_type="nested_univ"),
+            name="ArrowHead",
+        ),
         UEADataset(path=DATAPATH, name="GunPoint", target_name="class_val"),
     ],
 )
@@ -161,7 +164,7 @@ def test_single_dataset_single_strategy_against_sklearn(
 # simple test of sign test and ranks
 def test_stat():
     """Test sign ranks."""
-    data = load_gunpoint(split="train", return_X_y=False)
+    data = load_gunpoint(split="train", return_X_y=False, return_type="nested_univ")
     dataset = RAMDataset(dataset=data, name="gunpoint")
     task = TSCTask(target="class_val")
 

diff --git a/sktime/benchmarking/tests/test_tasks.py b/sktime/benchmarking/tests/test_tasks.py
@@ -11,7 +11,7 @@
 
 TASKS = (TSCTask, TSRTask)
 
-gunpoint = load_gunpoint(return_X_y=False)
+gunpoint = load_gunpoint(return_X_y=False, return_type="nested_univ")
 shampoo_sales = load_shampoo_sales()
 
 BASE_READONLY_ATTRS = ("target", "features", "metadata")

diff --git a/sktime/classification/distance_based/tests/test_time_series_neighbors.py b/sktime/classification/distance_based/tests/test_time_series_neighbors.py
@@ -44,8 +44,8 @@
 def test_knn_on_unit_test(distance_key):
     """Test function for elastic knn, to be reinstated soon."""
     # load arrowhead data for unit tests
-    X_train, y_train = load_unit_test(split="train", return_X_y=True)
-    X_test, y_test = load_unit_test(split="test", return_X_y=True)
+    X_train, y_train = load_unit_test(split="train")
+    X_test, y_test = load_unit_test(split="test")
     knn = KNeighborsTimeSeriesClassifier(
         distance=distance_key,
     )
@@ -61,8 +61,8 @@ def test_knn_on_unit_test(distance_key):
 @pytest.mark.parametrize("distance_key", distance_functions)
 def test_knn_bounding_matrix(distance_key):
     """Test knn with custom bounding parameters."""
-    X_train, y_train = load_unit_test(split="train", return_X_y=True)
-    X_test, y_test = load_unit_test(split="test", return_X_y=True)
+    X_train, y_train = load_unit_test(split="train")
+    X_test, y_test = load_unit_test(split="test")
     knn = KNeighborsTimeSeriesClassifier(
         distance=distance_key, distance_params={"window": 0.5}
     )

diff --git a/sktime/classification/early_classification/tests/test_probability_threshold.py b/sktime/classification/early_classification/tests/test_probability_threshold.py
@@ -8,14 +8,13 @@
 )
 from sktime.classification.interval_based import TimeSeriesForestClassifier
 from sktime.datasets import load_unit_test
-from sktime.datatypes._panel._convert import from_nested_to_3d_numpy
 
 
 def test_prob_threshold_on_unit_test_data():
     """Test of ProbabilityThresholdEarlyClassifier on unit test data."""
     # load unit test data
-    X_train, y_train = load_unit_test(split="train", return_X_y=True)
-    X_test, y_test = load_unit_test(split="test", return_X_y=True)
+    X_train, y_train = load_unit_test(split="train")
+    X_test, y_test = load_unit_test(split="test")
     indices = np.random.RandomState(0).choice(len(y_train), 10, replace=False)
 
     # train probability threshold
@@ -30,7 +29,6 @@ def test_prob_threshold_on_unit_test_data():
     final_probas = np.zeros((10, 2))
     final_decisions = np.zeros(10)
 
-    X_test = from_nested_to_3d_numpy(X_test)
     states = None
     for i in pt.classification_points:
         X = X_test[indices, :, :i]