Hyperopt steps per epoch not being computed correctly #2175

arnavgarg1 · 2022-06-21T19:34:55Z

Following up on this issue, when using windowing in the backend over a partitioned dataset (> 1 partition), we observed that the steps_per_epoch was being under-calculated. This results in each epoch only comprising of a fraction of the dataset.

This PR modifies the RayDatasetShard class to use size of the RayDataset to calculate the number of epochs instead of the size of the DatasetPipeline to calculate the number of epochs, bringing back the number of epochs to what we'd expect.

Validation that this change works:

Dataset size (100MB) ~ 354000 rows in the training data
Partitioning the dataset into 10 blocks
Setting the loader's window size to 2048 bytes

Before:

Calling train_cli() with the dataset resulted in:
- 10 windows being created
- Just two steps per epoch, and with a batch size of ~32K, that meant we would only train over 64K rows per epoch

Now

Calling train_cli() with the dataset results in
- 10 windows being created
- 11 steps per epoch, and with a batch size of ~32K, that means ~354K rows per epoch which is correct

2022-06-21 19:36:36,768 INFO dataset.py:2643 -- Created DatasetPipeline with 10 windows: 1.75MiB min, 1.78MiB max, 1.77MiB mean
2022-06-21 19:36:36,802 INFO dataset.py:2643 -- Created DatasetPipeline with 10 windows: 0.52MiB min, 0.54MiB max, 0.53MiB mean
2022-06-21 19:36:36,834 INFO dataset.py:2643 -- Created DatasetPipeline with 10 windows: 0.27MiB min, 0.29MiB max, 0.28MiB mean

100 [00:00<?, ?it/s](BaseWorkerMixin pid=13772) Note: steps_per_checkpoint (was 2000) is now set to the number of steps per epoch: 11.
(BaseWorkerMixin pid=13772) 
(BaseWorkerMixin pid=13772) Training for 1100 step(s), approximately 100 epoch(s).
(BaseWorkerMixin pid=13772) Early stopping policy: 10 round(s) of evaluation, or 110 step(s), approximately 10 epoch(s).

github-actions · 2022-06-21T20:18:16Z

Unit Test Results

      6 files ±0       6 suites ±0 2h 20m 35s ⏱️ - 11m 39s
2 920 tests +1 2 874 ✔️ +1   46 💤 ±0 0 ❌ ±0
8 760 runs +3 8 618 ✔️ +3 142 💤 ±0 0 ❌ ±0

Results for commit 9f0b3dc. ± Comparison against base commit 884c319.

♻️ This comment has been updated with latest results.

justinxzhao · 2022-06-21T23:17:30Z

ludwig/backend/ray.py

@@ -259,8 +267,11 @@ def tune_learning_rate_fn(
        initialize_pytorch(horovod=hvd)

        pipe = dataset.pipeline(shuffle=False, **data_loader_kwargs)
+
+        # Expensive blocking call


Is this because of the new len(dataset), or was this always an expensive blocking call?

len(dataset) does a full iteration over the dataset (as it does not know the size apriori).

@justinxzhao It was always an expensive blocking call. I think calling the count() method on a Ray dataset or dataset pipeline is blocking and forces a full iteration over the dataset like Travis mentioned

tgaddair · 2022-06-22T15:09:25Z

ludwig/backend/ray.py

@@ -160,6 +163,7 @@ def train_fn(

        train_shard = RayDatasetShard(
            rt.get_dataset_shard("train"),
+            train_dataset_size,


This is the dataset size before being split among the workers. Have you tested this with multiple training workers?

I believe we need to also divide this by the number of workers, plus account for rounding (see Ray Datasets implementation of split()).

@tgaddair This would just be setting

backend: trainer: num_workers: 2

or

backend: trainer: num_workers: 4

right? If yes, then this still works as intended with this fix.

I think the reason multiple workers isn't an issue is because we call the ray.data.Dataset.split() method on the DatasetPipeline objects, splitting the pipeline amongst multiple workers, rather than the dataset itself. So as long as we use the size of the dataset directly, we should be okay.

Would love it if someone can double check this understanding. @ShreyaR are you able to pull this branch and see if these changes fix the problem you were seeing before?

I suppose what I don't understand is that the number of steps per epoch is dependent on the number of workers. So if you have 100 batches and 4 workers, there should be 25 steps per epoch. So how do we account for this in this approach?

In the previous implementation here, you can see that we're taking the length of the dataset pipeline after it has been split, which accounts for the number of workers.

@tgaddair Gotcha, that makes sense. I'm able to reproduce what you said and can see that this implementation doesn't account for the number of workers and rounding. Will look into this again and create a fix that is in line with what you described.

I wonder if it would be possible to window after splitting to work around this? Seems like it should be the same effect, more or less.

tgaddair · 2022-06-22T15:10:23Z

ludwig/backend/ray.py

@@ -259,8 +267,11 @@ def tune_learning_rate_fn(
        initialize_pytorch(horovod=hvd)

        pipe = dataset.pipeline(shuffle=False, **data_loader_kwargs)
+
+        # Expensive blocking call


len(dataset) does a full iteration over the dataset (as it does not know the size apriori).

…itions

arnavgarg1 · 2022-06-26T03:21:37Z

I spent some more time investigating this issue today. I've been using the 100MB dataset.

A few things I noticed:

Changing window_size_bytes to any range of values between 128 and 8192 didn't cause the number of windows to change, i.e., number of windows created = number of dataset partitions (which felt odd, but I guess it is possible).
window_size_bytes or partitioning a dataset don't individually cause issues with steps_per_epoch, but rather only when they are both used in conjunction.

To account for these observations, my latest push keeps track of whether windowing is being used and the number of partitions/blocks in the dataset within the RayDataset class. This is then factored when calculating the size of each RayDatasetShard. This seems to fix the issue for steps_per_epoch not being calculated correctly during training, but now introduces a bug where the message printed out during evaluation looks incorrect because the number of epochs is artificially getting incremented more than it should. For e.g., if I use 2 workers (6 steps per epoch), the first round of evaluation is run after 6 steps. The evaluation message says "Running evaluation for 6 steps, 5 epochs", but this should actually say 1 epoch.

Will continue to look into why this is happening

ShreyaR · 2022-06-28T10:25:17Z

@arnavgarg1 not sure why the evaluation epochs would get messed up. A few things to try out:

Is the evaluation epochs correct when you only use one worker?
Can you check if the other places where epoch number is printed is correct? E.g. in training, etc. I'd also double check wherever progress_tracker.epochs is accessed/updated, add break points, and see what the current epoch number is.

justinxzhao · 2022-06-28T13:56:25Z

This update to progress_tracker.epochs is not protected by self.is_coordinator. If that's the culprit then perhaps the epoch overcounting issue is not unique to your change.

ShreyaR

I'm not sure if tracking the total number of partitions and calculating the size of the dataset by multiplying by the total number of partitions is the way to go.

I like @tgaddair's suggestion of doing pipelining on each shard instead of on the entire dataset. Here's some code changes you can make to achieve that:

You can update RayTrainerV2's train function to pass in RayDataset instead of DatasetPipeline objects.

        dataset = {"train": training_set}
        if validation_set is not None:
            dataset["val"] = validation_set
        if test_set is not None:
            dataset["test"] = test_set

You will also need to update RayDatasetShard so that it expects a RayDataset instead of a DatasetPipeline.

class RayDatasetShard(Dataset):
    def __init__(...):
        self.dataset_iter = dataset_shard.pipeline(...).iter_datasets()

    @lru_cache(1)
    def __len__(self):
        # TODO(travis): find way to avoid calling this, as it's expensive
        return self.dataset_shard.count()

You may need to make some other changes and do some general bookkeeping, but that's the broad idea. Let me know if this makes sense.

ShreyaR · 2022-06-28T21:55:49Z

ludwig/data/dataset/ray.py

-        return next(self.dataset_iter).count()
+        next_iteration_length = next(self.dataset_iter).count()
+        if self.num_dataset_partitions > 1 and self.window_status:
+            return next_iteration_length * self.num_dataset_partitions


My concern with this solution is that this might not be accurate -- two partitions may not have the same number of examples.

arnavgarg1 · 2022-06-29T17:50:52Z

I'm not sure if tracking the total number of partitions and calculating the size of the dataset by multiplying by the total number of partitions is the way to go.

I like @tgaddair's suggestion of doing pipelining on each shard instead of on the entire dataset. Here's some code changes you can make to achieve that:

You can update RayTrainerV2's train function to pass in RayDataset instead of DatasetPipeline objects.
        dataset = {"train": training_set}
        if validation_set is not None:
            dataset["val"] = validation_set
        if test_set is not None:
            dataset["test"] = test_set
You will also need to update RayDatasetShard so that it expects a RayDataset instead of a DatasetPipeline.
class RayDatasetShard(Dataset):
    def __init__(...):
        self.dataset_iter = dataset_shard.pipeline(...).iter_datasets()

    @lru_cache(1)
    def __len__(self):
        # TODO(travis): find way to avoid calling this, as it's expensive
        return self.dataset_shard.count()
You may need to make some other changes and do some general bookkeeping, but that's the broad idea. Let me know if this makes sense.

@ShreyaR This makes a lot of sense and also is very clean, thank you for your help! I will update this today and see if it fixes the issue that we were seeing

…tly' of https://github.com/ludwig-ai/ludwig into 2144-hyperopt-steps-per-epoch-not-being-computed-correctly

arnavgarg1 · 2022-07-08T21:48:27Z

This PR has been updated now to correctly read batches from iter_epochs. I've tested it with multiple workers and multiple window sizes and it works as intended. Should be good to go - thanks for all your help @ShreyaR @tgaddair @justinxzhao

arnavgarg1 · 2022-07-08T21:57:24Z

The biggest change is to pull batches from the epoch iterator which creates a pipeline over the windows. Prior to this, we were pulling in batches from the windowed dataset directly which caused fewer steps per epoch since the dataset was a subset of the overall dataset.

* getting count of dataset instead of window * read length from dataset instead of pipeline * Removing older commented code * Refactor * Removing dead code * Modify RayDatasetShard length to factor in windowing and dataset partitions * Moving to iter_epochs() * Working fix

arnavgarg1 added 3 commits June 16, 2022 17:42

getting count of dataset instead of window

e426356

read length from dataset instead of pipeline

79a28d9

Removing older commented code

9278499

arnavgarg1 linked an issue Jun 21, 2022 that may be closed by this pull request

[Hyperopt] Steps per epoch not being computed correctly #2144

Closed

arnavgarg1 self-assigned this Jun 21, 2022

arnavgarg1 marked this pull request as ready for review June 21, 2022 20:22

arnavgarg1 added 2 commits June 21, 2022 20:39

Refactor

f957909

Removing dead code

f6b7ce0

arnavgarg1 marked this pull request as draft June 21, 2022 20:41

arnavgarg1 marked this pull request as ready for review June 21, 2022 22:41

arnavgarg1 requested review from ShreyaR, tgaddair and justinxzhao June 21, 2022 22:41

arnavgarg1 changed the title ~~2144 hyperopt steps per epoch not being computed correctly~~ Hyperopt steps per epoch not being computed correctly Jun 21, 2022

justinxzhao approved these changes Jun 21, 2022

View reviewed changes

tgaddair reviewed Jun 22, 2022

View reviewed changes

arnavgarg1 marked this pull request as draft June 23, 2022 16:47

Modify RayDatasetShard length to factor in windowing and dataset part…

33fadbf

…itions

ShreyaR reviewed Jun 28, 2022

View reviewed changes

arnavgarg1 added 4 commits July 8, 2022 01:30

Moving to iter_epochs()

33b8cd9

Merge with master

05a3995

Merge branch '2144-hyperopt-steps-per-epoch-not-being-computed-correc…

871d50a

…tly' of https://github.com/ludwig-ai/ludwig into 2144-hyperopt-steps-per-epoch-not-being-computed-correctly

Working fix

9f0b3dc

arnavgarg1 marked this pull request as ready for review July 8, 2022 21:39

arnavgarg1 requested review from justinxzhao, ShreyaR and tgaddair and removed request for ShreyaR July 8, 2022 21:57

justinxzhao approved these changes Jul 8, 2022

View reviewed changes

justinxzhao merged commit 7c929d3 into master Jul 11, 2022

justinxzhao deleted the 2144-hyperopt-steps-per-epoch-not-being-computed-correctly branch July 11, 2022 17:38

justinxzhao mentioned this pull request Jul 11, 2022

Cherrypick fixes to 0.5 #2257

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyperopt steps per epoch not being computed correctly #2175

Hyperopt steps per epoch not being computed correctly #2175

arnavgarg1 commented Jun 21, 2022 •

edited

Loading

github-actions bot commented Jun 21, 2022 •

edited

Loading

justinxzhao Jun 21, 2022

tgaddair Jun 22, 2022

arnavgarg1 Jun 22, 2022

tgaddair Jun 22, 2022

arnavgarg1 Jun 22, 2022

tgaddair Jun 23, 2022

arnavgarg1 Jun 23, 2022

tgaddair Jun 23, 2022

tgaddair Jun 22, 2022

arnavgarg1 commented Jun 26, 2022 •

edited

Loading

ShreyaR commented Jun 28, 2022

justinxzhao commented Jun 28, 2022

ShreyaR left a comment •

edited

Loading

ShreyaR Jun 28, 2022

arnavgarg1 commented Jun 29, 2022

arnavgarg1 commented Jul 8, 2022

arnavgarg1 commented Jul 8, 2022

Hyperopt steps per epoch not being computed correctly #2175

Hyperopt steps per epoch not being computed correctly #2175

Conversation

arnavgarg1 commented Jun 21, 2022 • edited Loading

github-actions bot commented Jun 21, 2022 • edited Loading

Unit Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnavgarg1 commented Jun 26, 2022 • edited Loading

ShreyaR commented Jun 28, 2022

justinxzhao commented Jun 28, 2022

ShreyaR left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnavgarg1 commented Jun 29, 2022

arnavgarg1 commented Jul 8, 2022

arnavgarg1 commented Jul 8, 2022

arnavgarg1 commented Jun 21, 2022 •

edited

Loading

github-actions bot commented Jun 21, 2022 •

edited

Loading

arnavgarg1 commented Jun 26, 2022 •

edited

Loading

ShreyaR left a comment •

edited

Loading