From 021bf682841dd6e0a57cc9c2d8f52afc9bf8e57d Mon Sep 17 00:00:00 2001
From: Balaji Veeramani <balaji@anyscale.com>
Date: Thu, 14 Mar 2024 11:31:10 -0700
Subject: [PATCH] Remove 'Using Preprocessors'

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
---
 doc/source/data/preprocessors.rst             | 223 ------------------
 doc/source/data/user-guide.rst                |   1 -
 .../train/distributed-xgboost-lightgbm.rst    |   2 +-
 .../data-loading-preprocessing.rst            |   2 +-
 4 files changed, 2 insertions(+), 226 deletions(-)
 delete mode 100644 doc/source/data/preprocessors.rst
diff --git a/doc/source/data/preprocessors.rst b/doc/source/data/preprocessors.rst
deleted file mode 100644
index 13730d103071..000000000000
--- a/doc/source/data/preprocessors.rst
+++ /dev/null
@@ -1,223 +0,0 @@
-.. _data-preprocessors:
-
-Using Preprocessors
-===================
-
-Data preprocessing is a common technique for transforming raw data into features for a machine learning model.
-In general, you may want to apply the same preprocessing logic to your offline training data and online inference data.
-
-This page covers *preprocessors*, which are a higher level API on top of existing Ray Data operations like ``map_batches``,
-targeted towards tabular and structured data use cases.
-
-If you are working with tabular data, you should use Ray Data preprocessors. However, the recommended way to perform preprocessing 
-for unstructured data is to :ref:`use existing Ray Data operations <transforming_data>` instead of preprocessors. 
-
-
-.. https://docs.google.com/drawings/d/1ZIbsXv5vvwTVIEr2aooKxuYJ_VL7-8VMNlRinAiPaTI/edit
-
-.. image:: images/preprocessors.svg
-
-
-Overview
---------
-
-The :class:`Preprocessor <ray.data.preprocessor.Preprocessor>` class has four public methods:
-
-#. :meth:`fit() <ray.data.preprocessor.Preprocessor.fit>`: Compute state information about a :class:`Dataset <ray.data.Dataset>` (for example, the mean or standard deviation of a column)
-   and save it to the :class:`Preprocessor <ray.data.preprocessor.Preprocessor>`. This information is used to perform :meth:`transform() <ray.data.preprocessor.Preprocessor.transform>`, and the method is typically called on a
-   training dataset.
-#. :meth:`transform() <ray.data.preprocessor.Preprocessor.transform>`: Apply a transformation to a :class:`Dataset <ray.data.Dataset>`.
-   If the :class:`Preprocessor <ray.data.preprocessor.Preprocessor>` is stateful, then :meth:`fit() <ray.data.preprocessor.Preprocessor.fit>` must be called first. This method is typically called on training,
-   validation, and test datasets.
-#. :meth:`transform_batch() <ray.data.preprocessor.Preprocessor.transform_batch>`: Apply a transformation to a single :class:`batch <ray.train.predictor.DataBatchType>` of data. This method is typically called on online or offline inference data.
-#. :meth:`fit_transform() <ray.data.preprocessor.Preprocessor.fit_tranform>`: Syntactic sugar for calling both :meth:`fit() <ray.data.preprocessor.Preprocessor.fit>` and :meth:`transform() <ray.data.preprocessor.Preprocessor.transform>` on a :class:`Dataset <ray.data.Dataset>`.
-
-To show these methods in action, walk through a basic example. First, set up two simple Ray ``Dataset``\s.
-
-.. literalinclude:: doc_code/preprocessors.py
-    :language: python
-    :start-after: __preprocessor_setup_start__
-    :end-before: __preprocessor_setup_end__
-
-Next, ``fit`` the ``Preprocessor`` on one ``Dataset``, and then ``transform`` both ``Dataset``\s with this fitted information.
-
-.. literalinclude:: doc_code/preprocessors.py
-    :language: python
-    :start-after: __preprocessor_fit_transform_start__
-    :end-before: __preprocessor_fit_transform_end__
-
-Finally, call ``transform_batch`` on a single batch of data.
-
-.. literalinclude:: doc_code/preprocessors.py
-    :language: python
-    :start-after: __preprocessor_transform_batch_start__
-    :end-before: __preprocessor_transform_batch_end__
-
-The most common way of using a preprocessor is by using it on a :ref:`Ray Data dataset <data>`, which is then passed to a Ray Train :ref:`Trainer <train-docs>`. See also:
-
-* Ray Train's data preprocessing and ingest section for :ref:`PyTorch <data-ingest-torch>`
-* Ray Train's data preprocessing and ingest section for :ref:`LightGBM/XGBoost <data-ingest-gbdt>`
-
-Types of preprocessors
-----------------------
-
-Built-in preprocessors
-~~~~~~~~~~~~~~~~~~~~~~
-
-Ray Data provides a handful of preprocessors out of the box.
-
-**Generic preprocessors**
-
-.. autosummary::
-  :nosignatures:
-
-    ray.data.preprocessors.Concatenator
-    ray.data.preprocessor.Preprocessor
-    ray.data.preprocessors.SimpleImputer
-
-**Categorical encoders**
-
-.. autosummary::
-  :nosignatures:
-
-    ray.data.preprocessors.Categorizer
-    ray.data.preprocessors.LabelEncoder
-    ray.data.preprocessors.MultiHotEncoder
-    ray.data.preprocessors.OneHotEncoder
-    ray.data.preprocessors.OrdinalEncoder
-
-**Feature scalers**
-
-.. autosummary::
-  :nosignatures:
-
-    ray.data.preprocessors.MaxAbsScaler
-    ray.data.preprocessors.MinMaxScaler
-    ray.data.preprocessors.Normalizer
-    ray.data.preprocessors.PowerTransformer
-    ray.data.preprocessors.RobustScaler
-    ray.data.preprocessors.StandardScaler
-
-**Utilities**
-
-.. autosummary::
-  :nosignatures:
-
-    ray.data.Dataset.train_test_split
-
-Which preprocessor should you use?
-----------------------------------
-
-The type of preprocessor you use depends on what your data looks like. This section
-provides tips on handling common data formats.
-
-Categorical data
-~~~~~~~~~~~~~~~~
-
-Most models expect numerical inputs. To represent your categorical data in a way your
-model can understand, encode categories using one of the preprocessors described below.
-
-.. list-table::
-   :header-rows: 1
-
-   * - Categorical Data Type
-     - Example
-     - Preprocessor
-   * - Labels
-     - ``"cat"``, ``"dog"``, ``"airplane"``
-     - :class:`~ray.data.preprocessors.LabelEncoder`
-   * - Ordered categories
-     - ``"bs"``, ``"md"``, ``"phd"``
-     - :class:`~ray.data.preprocessors.OrdinalEncoder`
-   * - Unordered categories
-     - ``"red"``, ``"green"``, ``"blue"``
-     - :class:`~ray.data.preprocessors.OneHotEncoder`
-   * - Lists of categories
-     - ``("sci-fi", "action")``, ``("action", "comedy", "animated")``
-     - :class:`~ray.data.preprocessors.MultiHotEncoder`
-
-.. note::
-    If you're using LightGBM, you don't need to encode your categorical data. Instead,
-    use :class:`~ray.data.preprocessors.Categorizer` to convert your data to
-    `pandas.CategoricalDtype`.
-
-Numerical data
-~~~~~~~~~~~~~~
-
-To ensure your models behaves properly, normalize your numerical data. Reference the
-table below to determine which preprocessor to use.
-
-.. list-table::
-   :header-rows: 1
-
-   * - Data Property
-     - Preprocessor
-   * - Your data is approximately normal
-     - :class:`~ray.data.preprocessors.StandardScaler`
-   * - Your data is sparse
-     - :class:`~ray.data.preprocessors.MaxAbsScaler`
-   * - Your data contains many outliers
-     - :class:`~ray.data.preprocessors.RobustScaler`
-   * - Your data isn't normal, but you need it to be
-     - :class:`~ray.data.preprocessors.PowerTransformer`
-   * - You need unit-norm rows
-     - :class:`~ray.data.preprocessors.Normalizer`
-   * - You aren't sure what your data looks like
-     - :class:`~ray.data.preprocessors.MinMaxScaler`
-
-.. warning::
-    These preprocessors operate on numeric columns. If your dataset contains columns of
-    type :class:`~ray.air.util.tensor_extensions.pandas.TensorDtype`, you may need to
-    :ref:`implement a custom preprocessor <air-custom-preprocessors>`.
-
-Additionally, if your model expects a tensor or ``ndarray``, create a tensor using
-:class:`~ray.data.preprocessors.Concatenator`.
-
-.. tip::
-  Built-in feature scalers like :class:`~ray.data.preprocessors.StandardScaler` don't
-  work on :class:`~ray.air.util.tensor_extensions.pandas.TensorDtype` columns, so apply
-  :class:`~ray.data.preprocessors.Concatenator` after feature scaling.
-
-  .. literalinclude:: doc_code/preprocessors.py
-    :language: python
-    :start-after: __concatenate_start__
-    :end-before: __concatenate_end__
-
-
-Filling in missing values
-~~~~~~~~~~~~~~~~~~~~~~~~~
-
-If your dataset contains missing values, replace them with
-:class:`~ray.data.preprocessors.SimpleImputer`.
-
-.. literalinclude:: doc_code/preprocessors.py
-    :language: python
-    :start-after: __simple_imputer_start__
-    :end-before: __simple_imputer_end__
-
-
-Chaining preprocessors
-~~~~~~~~~~~~~~~~~~~~~~
-
-If you need to apply more than one preprocessor, simply apply them in sequence on your dataset.
-
-.. literalinclude:: doc_code/preprocessors.py
-    :language: python
-    :start-after: __chain_start__
-    :end-before: __chain_end__
-
-
-.. _air-custom-preprocessors:
-
-Implementing custom preprocessors
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-If you want to implement a custom preprocessor that needs to be fit, extend the
-:class:`~ray.data.preprocessor.Preprocessor` base class.
-
-.. literalinclude:: doc_code/preprocessors.py
-    :language: python
-    :start-after: __custom_stateful_start__
-    :end-before: __custom_stateful_end__
-
-If your preprocessor doesn't need to be fit, use :meth:`map_batches() <ray.data.Dataset.map_batches>` to directly transform your dataset. For more details, see :ref:`Transforming Data <transforming_data>`.
diff --git a/doc/source/data/user-guide.rst b/doc/source/data/user-guide.rst
index 0eeda1296761..7c43699ffbf1 100644
--- a/doc/source/data/user-guide.rst
+++ b/doc/source/data/user-guide.rst
@@ -23,6 +23,5 @@ show you how achieve several tasks.
     working-with-pytorch
     batch_inference
     performance-tips
-    preprocessors
     monitoring-your-workload
     custom-datasource-example
\ No newline at end of file
diff --git a/doc/source/train/distributed-xgboost-lightgbm.rst b/doc/source/train/distributed-xgboost-lightgbm.rst
index 6361a055f8b5..9c7dcf06a33d 100644
--- a/doc/source/train/distributed-xgboost-lightgbm.rst
+++ b/doc/source/train/distributed-xgboost-lightgbm.rst
@@ -204,7 +204,7 @@ machines have 16 CPUs in addition to the 4 GPUs, each actor should have
 How to preprocess data for training?
 ------------------------------------
 
-Particularly for tabular data, Ray Data comes with out-of-the-box :ref:`preprocessors <data-preprocessors>` that implement common feature preprocessing operations.
+Particularly for tabular data, Ray Data comes with out-of-the-box :ref:`preprocessors <preprocessor-ref>` that implement common feature preprocessing operations.
 You can use this with Ray Train Trainers by applying them on the dataset before passing the dataset into a Trainer. For example:
 
 
diff --git a/doc/source/train/user-guides/data-loading-preprocessing.rst b/doc/source/train/user-guides/data-loading-preprocessing.rst
index 7c86c86c37ab..4f0860c3efdc 100644
--- a/doc/source/train/user-guides/data-loading-preprocessing.rst
+++ b/doc/source/train/user-guides/data-loading-preprocessing.rst
@@ -510,7 +510,7 @@ Preprocessing structured data
     This section is for tabular/structured data. The recommended way for preprocessing unstructured data is to use
     Ray Data operations such as `map_batches`. See the :ref:`Ray Data Working with Pytorch guide <working_with_pytorch>` for more details.
 
-For tabular data, we recommend using Ray Data :ref:`preprocessors <data-preprocessors>`, which implement common data preprocessing operations.
+For tabular data, we recommend using Ray Data :ref:`preprocessors <preprocessor-ref>`, which implement common data preprocessing operations.
 You can use this with Ray Train Trainers by applying them on the dataset before passing the dataset into a Trainer. For example:
 
 .. testcode::