From 021bf682841dd6e0a57cc9c2d8f52afc9bf8e57d Mon Sep 17 00:00:00 2001 From: Balaji Veeramani Date: Thu, 14 Mar 2024 11:31:10 -0700 Subject: [PATCH] Remove 'Using Preprocessors' Signed-off-by: Balaji Veeramani --- doc/source/data/preprocessors.rst | 223 ------------------ doc/source/data/user-guide.rst | 1 - .../train/distributed-xgboost-lightgbm.rst | 2 +- .../data-loading-preprocessing.rst | 2 +- 4 files changed, 2 insertions(+), 226 deletions(-) delete mode 100644 doc/source/data/preprocessors.rst diff --git a/doc/source/data/preprocessors.rst b/doc/source/data/preprocessors.rst deleted file mode 100644 index 13730d103071..000000000000 --- a/doc/source/data/preprocessors.rst +++ /dev/null @@ -1,223 +0,0 @@ -.. _data-preprocessors: - -Using Preprocessors -=================== - -Data preprocessing is a common technique for transforming raw data into features for a machine learning model. -In general, you may want to apply the same preprocessing logic to your offline training data and online inference data. - -This page covers *preprocessors*, which are a higher level API on top of existing Ray Data operations like ``map_batches``, -targeted towards tabular and structured data use cases. - -If you are working with tabular data, you should use Ray Data preprocessors. However, the recommended way to perform preprocessing -for unstructured data is to :ref:`use existing Ray Data operations ` instead of preprocessors. - - -.. https://docs.google.com/drawings/d/1ZIbsXv5vvwTVIEr2aooKxuYJ_VL7-8VMNlRinAiPaTI/edit - -.. image:: images/preprocessors.svg - - -Overview --------- - -The :class:`Preprocessor ` class has four public methods: - -#. :meth:`fit() `: Compute state information about a :class:`Dataset ` (for example, the mean or standard deviation of a column) - and save it to the :class:`Preprocessor `. This information is used to perform :meth:`transform() `, and the method is typically called on a - training dataset. -#. :meth:`transform() `: Apply a transformation to a :class:`Dataset `. - If the :class:`Preprocessor ` is stateful, then :meth:`fit() ` must be called first. This method is typically called on training, - validation, and test datasets. -#. :meth:`transform_batch() `: Apply a transformation to a single :class:`batch ` of data. This method is typically called on online or offline inference data. -#. :meth:`fit_transform() `: Syntactic sugar for calling both :meth:`fit() ` and :meth:`transform() ` on a :class:`Dataset `. - -To show these methods in action, walk through a basic example. First, set up two simple Ray ``Dataset``\s. - -.. literalinclude:: doc_code/preprocessors.py - :language: python - :start-after: __preprocessor_setup_start__ - :end-before: __preprocessor_setup_end__ - -Next, ``fit`` the ``Preprocessor`` on one ``Dataset``, and then ``transform`` both ``Dataset``\s with this fitted information. - -.. literalinclude:: doc_code/preprocessors.py - :language: python - :start-after: __preprocessor_fit_transform_start__ - :end-before: __preprocessor_fit_transform_end__ - -Finally, call ``transform_batch`` on a single batch of data. - -.. literalinclude:: doc_code/preprocessors.py - :language: python - :start-after: __preprocessor_transform_batch_start__ - :end-before: __preprocessor_transform_batch_end__ - -The most common way of using a preprocessor is by using it on a :ref:`Ray Data dataset `, which is then passed to a Ray Train :ref:`Trainer `. See also: - -* Ray Train's data preprocessing and ingest section for :ref:`PyTorch ` -* Ray Train's data preprocessing and ingest section for :ref:`LightGBM/XGBoost ` - -Types of preprocessors ----------------------- - -Built-in preprocessors -~~~~~~~~~~~~~~~~~~~~~~ - -Ray Data provides a handful of preprocessors out of the box. - -**Generic preprocessors** - -.. autosummary:: - :nosignatures: - - ray.data.preprocessors.Concatenator - ray.data.preprocessor.Preprocessor - ray.data.preprocessors.SimpleImputer - -**Categorical encoders** - -.. autosummary:: - :nosignatures: - - ray.data.preprocessors.Categorizer - ray.data.preprocessors.LabelEncoder - ray.data.preprocessors.MultiHotEncoder - ray.data.preprocessors.OneHotEncoder - ray.data.preprocessors.OrdinalEncoder - -**Feature scalers** - -.. autosummary:: - :nosignatures: - - ray.data.preprocessors.MaxAbsScaler - ray.data.preprocessors.MinMaxScaler - ray.data.preprocessors.Normalizer - ray.data.preprocessors.PowerTransformer - ray.data.preprocessors.RobustScaler - ray.data.preprocessors.StandardScaler - -**Utilities** - -.. autosummary:: - :nosignatures: - - ray.data.Dataset.train_test_split - -Which preprocessor should you use? ----------------------------------- - -The type of preprocessor you use depends on what your data looks like. This section -provides tips on handling common data formats. - -Categorical data -~~~~~~~~~~~~~~~~ - -Most models expect numerical inputs. To represent your categorical data in a way your -model can understand, encode categories using one of the preprocessors described below. - -.. list-table:: - :header-rows: 1 - - * - Categorical Data Type - - Example - - Preprocessor - * - Labels - - ``"cat"``, ``"dog"``, ``"airplane"`` - - :class:`~ray.data.preprocessors.LabelEncoder` - * - Ordered categories - - ``"bs"``, ``"md"``, ``"phd"`` - - :class:`~ray.data.preprocessors.OrdinalEncoder` - * - Unordered categories - - ``"red"``, ``"green"``, ``"blue"`` - - :class:`~ray.data.preprocessors.OneHotEncoder` - * - Lists of categories - - ``("sci-fi", "action")``, ``("action", "comedy", "animated")`` - - :class:`~ray.data.preprocessors.MultiHotEncoder` - -.. note:: - If you're using LightGBM, you don't need to encode your categorical data. Instead, - use :class:`~ray.data.preprocessors.Categorizer` to convert your data to - `pandas.CategoricalDtype`. - -Numerical data -~~~~~~~~~~~~~~ - -To ensure your models behaves properly, normalize your numerical data. Reference the -table below to determine which preprocessor to use. - -.. list-table:: - :header-rows: 1 - - * - Data Property - - Preprocessor - * - Your data is approximately normal - - :class:`~ray.data.preprocessors.StandardScaler` - * - Your data is sparse - - :class:`~ray.data.preprocessors.MaxAbsScaler` - * - Your data contains many outliers - - :class:`~ray.data.preprocessors.RobustScaler` - * - Your data isn't normal, but you need it to be - - :class:`~ray.data.preprocessors.PowerTransformer` - * - You need unit-norm rows - - :class:`~ray.data.preprocessors.Normalizer` - * - You aren't sure what your data looks like - - :class:`~ray.data.preprocessors.MinMaxScaler` - -.. warning:: - These preprocessors operate on numeric columns. If your dataset contains columns of - type :class:`~ray.air.util.tensor_extensions.pandas.TensorDtype`, you may need to - :ref:`implement a custom preprocessor `. - -Additionally, if your model expects a tensor or ``ndarray``, create a tensor using -:class:`~ray.data.preprocessors.Concatenator`. - -.. tip:: - Built-in feature scalers like :class:`~ray.data.preprocessors.StandardScaler` don't - work on :class:`~ray.air.util.tensor_extensions.pandas.TensorDtype` columns, so apply - :class:`~ray.data.preprocessors.Concatenator` after feature scaling. - - .. literalinclude:: doc_code/preprocessors.py - :language: python - :start-after: __concatenate_start__ - :end-before: __concatenate_end__ - - -Filling in missing values -~~~~~~~~~~~~~~~~~~~~~~~~~ - -If your dataset contains missing values, replace them with -:class:`~ray.data.preprocessors.SimpleImputer`. - -.. literalinclude:: doc_code/preprocessors.py - :language: python - :start-after: __simple_imputer_start__ - :end-before: __simple_imputer_end__ - - -Chaining preprocessors -~~~~~~~~~~~~~~~~~~~~~~ - -If you need to apply more than one preprocessor, simply apply them in sequence on your dataset. - -.. literalinclude:: doc_code/preprocessors.py - :language: python - :start-after: __chain_start__ - :end-before: __chain_end__ - - -.. _air-custom-preprocessors: - -Implementing custom preprocessors -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -If you want to implement a custom preprocessor that needs to be fit, extend the -:class:`~ray.data.preprocessor.Preprocessor` base class. - -.. literalinclude:: doc_code/preprocessors.py - :language: python - :start-after: __custom_stateful_start__ - :end-before: __custom_stateful_end__ - -If your preprocessor doesn't need to be fit, use :meth:`map_batches() ` to directly transform your dataset. For more details, see :ref:`Transforming Data `. diff --git a/doc/source/data/user-guide.rst b/doc/source/data/user-guide.rst index 0eeda1296761..7c43699ffbf1 100644 --- a/doc/source/data/user-guide.rst +++ b/doc/source/data/user-guide.rst @@ -23,6 +23,5 @@ show you how achieve several tasks. working-with-pytorch batch_inference performance-tips - preprocessors monitoring-your-workload custom-datasource-example \ No newline at end of file diff --git a/doc/source/train/distributed-xgboost-lightgbm.rst b/doc/source/train/distributed-xgboost-lightgbm.rst index 6361a055f8b5..9c7dcf06a33d 100644 --- a/doc/source/train/distributed-xgboost-lightgbm.rst +++ b/doc/source/train/distributed-xgboost-lightgbm.rst @@ -204,7 +204,7 @@ machines have 16 CPUs in addition to the 4 GPUs, each actor should have How to preprocess data for training? ------------------------------------ -Particularly for tabular data, Ray Data comes with out-of-the-box :ref:`preprocessors ` that implement common feature preprocessing operations. +Particularly for tabular data, Ray Data comes with out-of-the-box :ref:`preprocessors ` that implement common feature preprocessing operations. You can use this with Ray Train Trainers by applying them on the dataset before passing the dataset into a Trainer. For example: diff --git a/doc/source/train/user-guides/data-loading-preprocessing.rst b/doc/source/train/user-guides/data-loading-preprocessing.rst index 7c86c86c37ab..4f0860c3efdc 100644 --- a/doc/source/train/user-guides/data-loading-preprocessing.rst +++ b/doc/source/train/user-guides/data-loading-preprocessing.rst @@ -510,7 +510,7 @@ Preprocessing structured data This section is for tabular/structured data. The recommended way for preprocessing unstructured data is to use Ray Data operations such as `map_batches`. See the :ref:`Ray Data Working with Pytorch guide ` for more details. -For tabular data, we recommend using Ray Data :ref:`preprocessors `, which implement common data preprocessing operations. +For tabular data, we recommend using Ray Data :ref:`preprocessors `, which implement common data preprocessing operations. You can use this with Ray Train Trainers by applying them on the dataset before passing the dataset into a Trainer. For example: .. testcode::