[AIR] Add option for per-epoch preprocessor #31739

stephanie-wang · 2023-01-18T17:29:52Z

Why are these changes needed?

This adds an option to the AIR DatasetConfig for a preprocessor that gets reapplied on each epoch. Currently the implementation uses DatasetPipeline to ensure that the extra preprocessing step is overlapped with training.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

ericl · 2023-01-18T18:09:20Z

doc/source/ray-air/check-ingest.rst

+However, in some cases you may want to reapply a preprocessor on each epoch, for example to augment your training dataset with a randomized transform.
+
+To support this use case, AIR offers an additional *per-epoch preprocessor* that gets reapplied on each epoch, after all other preprocessors and right before dataset consumption (e.g., using :meth:`~ray.data.DatasetIterator.iter_batches()`).
+Per-epoch preprocessing also executes in parallel with dataset consumption to reduce pauses in dataset consumption.


Do we also fit() the per-epoch preprocessor?

Actually wasn't sure about this part because I don't really understand how fit() works...

when do we need to call fit()?

if the standard preprocessor is defined, do we need to fit() on the preprocessed dataset or the input dataset?

Right now, we call fit on start (actually fit_transform() I believe to create the original preprocessed dataset).. A fittable preprocessor isn't usable for transformation until it is fitted.

if the standard preprocessor is defined, do we need to fit() on the preprocessed dataset or the input dataset?

Hmm I'd think you would fit the preprocessed dataset, since this preprocessor is logically consuming the output of the previous one-time preprocessor.

Perhaps we should just raise ValueError if the per-epoch preprocessor requires fitting?

That's a good idea, thanks!

+1 on not allowing fittable per-epoch preprocessors

Yep +1 on not allowing fittable per-epoch preprocessors

ericl · 2023-01-18T18:10:23Z

doc/source/ray-air/doc_code/air_ingest.py

+# A randomized preprocessor that adds a random float to all values, to be
+# reapplied on each epoch after `preprocessor`. Each epoch will therefore add a
+# different random float to the scaled dataset.
+rand_preprocessor = BatchMapper(lambda df: df + random.random(), batch_format="pandas")


Shall we call this add_noise or something to avoid overloading the term "random" too many times in this example?

ericl · 2023-01-18T18:11:22Z

doc/source/ray-air/check-ingest.rst

+The standard preprocessor passed to the ``Trainer`` is only applied once to the initial dataset when using :ref:`bulk ingest <air-streaming-ingest>`.
+However, in some cases you may want to reapply a preprocessor on each epoch, for example to augment your training dataset with a randomized transform.
+
+To support this use case, AIR offers an additional *per-epoch preprocessor* that gets reapplied on each epoch, after all other preprocessors and right before dataset consumption (e.g., using :meth:`~ray.data.DatasetIterator.iter_batches()`).


Is the executing in parallel part only true for the pipelined enabled version?

I'm using DatasetPipeline under the hood so actually it is always true. I figure this is OK since the implementation detail is now hidden under DatasetIterator and the feature is experimental anyway. Long-term, I imagine we want to switch to the fully pipelined Datasets backend or we cache the preprocessed dataset and run the per-epoch preprocessing on the pipelined backend.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

amogkam · 2023-01-18T21:28:10Z

python/ray/air/tests/test_dataset_config.py

+    def multiply(x):
+        return x * 2
+
+    for max_object_store_memory_fraction in [None, 1, 0.3]:


should we use @pytest.mark.parametrize for this so that it will be easier to identify which case fails?

amogkam

Thanks @swang! LGTM-- just left 1 comment on using pytest for parametrization rather than doing it manually.

bveeramani

LGTM! This'll be a super useful addition

bveeramani · 2023-01-19T17:35:17Z

python/ray/train/_internal/dataset_spec.py

+                    dataset = dataset.repeat()
+                # TODO: Replace with preprocessor.transform when possible.
+                per_epoch_prep = config.per_epoch_preprocessor.transform_batch
+                dataset = dataset.map_batches(per_epoch_prep)


Not sure if I'm just being dumb here, but doesn't this apply the per-epoch preprocessor twice on the first epoch? Like, if config.per_epoch_preprocessor isn't None, then we apply it on both line 204 and line 219?

that's a good point...I think lines 218-219 need to be moved inside the if statement.

Ah thanks you're right, it should be under an elif. I'll add a test to make sure we're only applying it once.

bveeramani · 2023-01-19T17:38:04Z

python/ray/air/tests/test_dataset_config.py

+            },
+            {"train": ds},
+        )
+


Nit: could you add a comment here and maybe on line 487 describing why DatasetConfig raises a ValueError? It wasn't obvious to me why Preprocessor() is invalid from the reading the test.

amogkam · 2023-01-19T23:32:35Z

python/ray/train/_internal/dataset_spec.py

+                # Reapply the per epoch preprocessor on each epoch.
+                if isinstance(dataset, Dataset):
+                    dataset = dataset.repeat()
+                # TODO: Replace with preprocessor.transform when possible.


we can use preprocessor._transform_pipeline here now

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

…essor

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

This adds an option to the AIR DatasetConfig for a preprocessor that gets reapplied on each epoch. Currently the implementation uses DatasetPipeline to ensure that the extra preprocessing step is overlapped with training. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

stephanie-wang added 2 commits January 18, 2023 10:43

Per epoch preprocessor

53e9cf6

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

doc example

fd26d9a

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang requested review from richardliaw, gjoliver, krfricke, xwjiang2010, amogkam, matthewdeng, Yard1, maxpumperla, a team, ericl, scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners January 18, 2023 17:29

stephanie-wang assigned bveeramani, ericl and amogkam Jan 18, 2023

ericl reviewed Jan 18, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 18, 2023

update

35d06aa

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 18, 2023

amogkam reviewed Jan 18, 2023

View reviewed changes

amogkam approved these changes Jan 18, 2023

View reviewed changes

bveeramani approved these changes Jan 19, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 19, 2023

amogkam reviewed Jan 19, 2023

View reviewed changes

stephanie-wang added 4 commits January 26, 2023 20:00

fix

e31805e

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Merge remote-tracking branch 'upstream/master' into per-epoch-preproc…

52ce5e9

…essor

lint

8be8eda

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

api

c055ca4

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang requested a review from jiaodong as a code owner January 27, 2023 17:49

c21 approved these changes Jan 31, 2023

View reviewed changes

stephanie-wang merged commit ae167f0 into ray-project:master Jan 31, 2023

stephanie-wang deleted the per-epoch-preprocessor branch January 31, 2023 20:34

bveeramani mentioned this pull request Feb 2, 2023

[AIR] Add object detection example #31553

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Add option for per-epoch preprocessor #31739

[AIR] Add option for per-epoch preprocessor #31739

stephanie-wang commented Jan 18, 2023

ericl Jan 18, 2023

stephanie-wang Jan 18, 2023

ericl Jan 18, 2023 •

edited

Loading

stephanie-wang Jan 18, 2023

amogkam Jan 18, 2023

clarkzinzow Jan 18, 2023

ericl Jan 18, 2023

ericl Jan 18, 2023

stephanie-wang Jan 18, 2023 •

edited

Loading

amogkam Jan 18, 2023 •

edited

Loading

amogkam left a comment

bveeramani left a comment

bveeramani Jan 19, 2023

amogkam Jan 19, 2023

stephanie-wang Jan 27, 2023

bveeramani Jan 19, 2023

amogkam Jan 19, 2023

[AIR] Add option for per-epoch preprocessor #31739

[AIR] Add option for per-epoch preprocessor #31739

Conversation

stephanie-wang commented Jan 18, 2023

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl Jan 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang Jan 18, 2023 • edited Loading

Choose a reason for hiding this comment

amogkam Jan 18, 2023 • edited Loading

Choose a reason for hiding this comment

amogkam left a comment

Choose a reason for hiding this comment

bveeramani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl Jan 18, 2023 •

edited

Loading

stephanie-wang Jan 18, 2023 •

edited

Loading

amogkam Jan 18, 2023 •

edited

Loading