Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyperopt steps per epoch not being computed correctly #2175

Merged

Conversation

arnavgarg1
Copy link
Contributor

@arnavgarg1 arnavgarg1 commented Jun 21, 2022

Following up on this issue, when using windowing in the backend over a partitioned dataset (> 1 partition), we observed that the steps_per_epoch was being under-calculated. This results in each epoch only comprising of a fraction of the dataset.

This PR modifies the RayDatasetShard class to use size of the RayDataset to calculate the number of epochs instead of the size of the DatasetPipeline to calculate the number of epochs, bringing back the number of epochs to what we'd expect.

Validation that this change works:

  • Dataset size (100MB) ~ 354000 rows in the training data
  • Partitioning the dataset into 10 blocks
  • Setting the loader's window size to 2048 bytes

Before:

  • Calling train_cli() with the dataset resulted in:
    • 10 windows being created
    • Just two steps per epoch, and with a batch size of ~32K, that meant we would only train over 64K rows per epoch

Now

  • Calling train_cli() with the dataset results in
    • 10 windows being created
    • 11 steps per epoch, and with a batch size of ~32K, that means ~354K rows per epoch which is correct
2022-06-21 19:36:36,768 INFO dataset.py:2643 -- Created DatasetPipeline with 10 windows: 1.75MiB min, 1.78MiB max, 1.77MiB mean
2022-06-21 19:36:36,802 INFO dataset.py:2643 -- Created DatasetPipeline with 10 windows: 0.52MiB min, 0.54MiB max, 0.53MiB mean
2022-06-21 19:36:36,834 INFO dataset.py:2643 -- Created DatasetPipeline with 10 windows: 0.27MiB min, 0.29MiB max, 0.28MiB mean
100 [00:00<?, ?it/s](BaseWorkerMixin pid=13772) Note: steps_per_checkpoint (was 2000) is now set to the number of steps per epoch: 11.
(BaseWorkerMixin pid=13772) 
(BaseWorkerMixin pid=13772) Training for 1100 step(s), approximately 100 epoch(s).
(BaseWorkerMixin pid=13772) Early stopping policy: 10 round(s) of evaluation, or 110 step(s), approximately 10 epoch(s).

@arnavgarg1 arnavgarg1 linked an issue Jun 21, 2022 that may be closed by this pull request
@arnavgarg1 arnavgarg1 self-assigned this Jun 21, 2022
@github-actions
Copy link

github-actions bot commented Jun 21, 2022

Unit Test Results

       6 files  ±0         6 suites  ±0   2h 20m 35s ⏱️ - 11m 39s
2 920 tests +1  2 874 ✔️ +1    46 💤 ±0  0 ±0 
8 760 runs  +3  8 618 ✔️ +3  142 💤 ±0  0 ±0 

Results for commit 9f0b3dc. ± Comparison against base commit 884c319.

♻️ This comment has been updated with latest results.

@arnavgarg1 arnavgarg1 marked this pull request as ready for review June 21, 2022 20:22
@arnavgarg1 arnavgarg1 marked this pull request as draft June 21, 2022 20:41
@arnavgarg1 arnavgarg1 marked this pull request as ready for review June 21, 2022 22:41
@arnavgarg1 arnavgarg1 changed the title 2144 hyperopt steps per epoch not being computed correctly Hyperopt steps per epoch not being computed correctly Jun 21, 2022
@@ -259,8 +267,11 @@ def tune_learning_rate_fn(
initialize_pytorch(horovod=hvd)

pipe = dataset.pipeline(shuffle=False, **data_loader_kwargs)

# Expensive blocking call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because of the new len(dataset), or was this always an expensive blocking call?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(dataset) does a full iteration over the dataset (as it does not know the size apriori).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinxzhao It was always an expensive blocking call. I think calling the count() method on a Ray dataset or dataset pipeline is blocking and forces a full iteration over the dataset like Travis mentioned

@@ -160,6 +163,7 @@ def train_fn(

train_shard = RayDatasetShard(
rt.get_dataset_shard("train"),
train_dataset_size,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the dataset size before being split among the workers. Have you tested this with multiple training workers?

I believe we need to also divide this by the number of workers, plus account for rounding (see Ray Datasets implementation of split()).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgaddair This would just be setting

backend:
    trainer:
        num_workers: 2

or

backend:
    trainer:
        num_workers: 4

right? If yes, then this still works as intended with this fix.

I think the reason multiple workers isn't an issue is because we call the ray.data.Dataset.split() method on the DatasetPipeline objects, splitting the pipeline amongst multiple workers, rather than the dataset itself. So as long as we use the size of the dataset directly, we should be okay.

Would love it if someone can double check this understanding. @ShreyaR are you able to pull this branch and see if these changes fix the problem you were seeing before?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose what I don't understand is that the number of steps per epoch is dependent on the number of workers. So if you have 100 batches and 4 workers, there should be 25 steps per epoch. So how do we account for this in this approach?

In the previous implementation here, you can see that we're taking the length of the dataset pipeline after it has been split, which accounts for the number of workers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgaddair Gotcha, that makes sense. I'm able to reproduce what you said and can see that this implementation doesn't account for the number of workers and rounding. Will look into this again and create a fix that is in line with what you described.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be possible to window after splitting to work around this? Seems like it should be the same effect, more or less.

@@ -259,8 +267,11 @@ def tune_learning_rate_fn(
initialize_pytorch(horovod=hvd)

pipe = dataset.pipeline(shuffle=False, **data_loader_kwargs)

# Expensive blocking call
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(dataset) does a full iteration over the dataset (as it does not know the size apriori).

@arnavgarg1 arnavgarg1 marked this pull request as draft June 23, 2022 16:47
@arnavgarg1
Copy link
Contributor Author

arnavgarg1 commented Jun 26, 2022

I spent some more time investigating this issue today. I've been using the 100MB dataset.

A few things I noticed:

  • Changing window_size_bytes to any range of values between 128 and 8192 didn't cause the number of windows to change, i.e., number of windows created = number of dataset partitions (which felt odd, but I guess it is possible).
  • window_size_bytes or partitioning a dataset don't individually cause issues with steps_per_epoch, but rather only when they are both used in conjunction.

To account for these observations, my latest push keeps track of whether windowing is being used and the number of partitions/blocks in the dataset within the RayDataset class. This is then factored when calculating the size of each RayDatasetShard. This seems to fix the issue for steps_per_epoch not being calculated correctly during training, but now introduces a bug where the message printed out during evaluation looks incorrect because the number of epochs is artificially getting incremented more than it should. For e.g., if I use 2 workers (6 steps per epoch), the first round of evaluation is run after 6 steps. The evaluation message says "Running evaluation for 6 steps, 5 epochs", but this should actually say 1 epoch.

Will continue to look into why this is happening

@ShreyaR
Copy link
Contributor

ShreyaR commented Jun 28, 2022

@arnavgarg1 not sure why the evaluation epochs would get messed up. A few things to try out:

  • Is the evaluation epochs correct when you only use one worker?
  • Can you check if the other places where epoch number is printed is correct? E.g. in training, etc. I'd also double check wherever progress_tracker.epochs is accessed/updated, add break points, and see what the current epoch number is.

@justinxzhao
Copy link
Contributor

This update to progress_tracker.epochs is not protected by self.is_coordinator. If that's the culprit then perhaps the epoch overcounting issue is not unique to your change.

Copy link
Contributor

@ShreyaR ShreyaR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if tracking the total number of partitions and calculating the size of the dataset by multiplying by the total number of partitions is the way to go.

I like @tgaddair's suggestion of doing pipelining on each shard instead of on the entire dataset. Here's some code changes you can make to achieve that:

You can update RayTrainerV2's train function to pass in RayDataset instead of DatasetPipeline objects.

        dataset = {"train": training_set}
        if validation_set is not None:
            dataset["val"] = validation_set
        if test_set is not None:
            dataset["test"] = test_set

You will also need to update RayDatasetShard so that it expects a RayDataset instead of a DatasetPipeline.

class RayDatasetShard(Dataset):
    def __init__(...):
        self.dataset_iter = dataset_shard.pipeline(...).iter_datasets()

    @lru_cache(1)
    def __len__(self):
        # TODO(travis): find way to avoid calling this, as it's expensive
        return self.dataset_shard.count()

You may need to make some other changes and do some general bookkeeping, but that's the broad idea. Let me know if this makes sense.

return next(self.dataset_iter).count()
next_iteration_length = next(self.dataset_iter).count()
if self.num_dataset_partitions > 1 and self.window_status:
return next_iteration_length * self.num_dataset_partitions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with this solution is that this might not be accurate -- two partitions may not have the same number of examples.

@arnavgarg1
Copy link
Contributor Author

I'm not sure if tracking the total number of partitions and calculating the size of the dataset by multiplying by the total number of partitions is the way to go.

I like @tgaddair's suggestion of doing pipelining on each shard instead of on the entire dataset. Here's some code changes you can make to achieve that:

You can update RayTrainerV2's train function to pass in RayDataset instead of DatasetPipeline objects.

        dataset = {"train": training_set}
        if validation_set is not None:
            dataset["val"] = validation_set
        if test_set is not None:
            dataset["test"] = test_set

You will also need to update RayDatasetShard so that it expects a RayDataset instead of a DatasetPipeline.

class RayDatasetShard(Dataset):
    def __init__(...):
        self.dataset_iter = dataset_shard.pipeline(...).iter_datasets()

    @lru_cache(1)
    def __len__(self):
        # TODO(travis): find way to avoid calling this, as it's expensive
        return self.dataset_shard.count()

You may need to make some other changes and do some general bookkeeping, but that's the broad idea. Let me know if this makes sense.

@ShreyaR This makes a lot of sense and also is very clean, thank you for your help! I will update this today and see if it fixes the issue that we were seeing

@arnavgarg1 arnavgarg1 marked this pull request as ready for review July 8, 2022 21:39
@arnavgarg1
Copy link
Contributor Author

This PR has been updated now to correctly read batches from iter_epochs. I've tested it with multiple workers and multiple window sizes and it works as intended. Should be good to go - thanks for all your help @ShreyaR @tgaddair @justinxzhao

@arnavgarg1
Copy link
Contributor Author

The biggest change is to pull batches from the epoch iterator which creates a pipeline over the windows. Prior to this, we were pulling in batches from the windowed dataset directly which caused fewer steps per epoch since the dataset was a subset of the overall dataset.

@arnavgarg1 arnavgarg1 requested review from justinxzhao, ShreyaR and tgaddair and removed request for ShreyaR July 8, 2022 21:57
@justinxzhao justinxzhao merged commit 7c929d3 into master Jul 11, 2022
@justinxzhao justinxzhao deleted the 2144-hyperopt-steps-per-epoch-not-being-computed-correctly branch July 11, 2022 17:38
justinxzhao pushed a commit that referenced this pull request Jul 11, 2022
* getting count of dataset instead of window

* read length from dataset instead of pipeline

* Removing older commented code

* Refactor

* Removing dead code

* Modify RayDatasetShard length to factor in windowing and dataset partitions

* Moving to iter_epochs()

* Working fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Hyperopt] Steps per epoch not being computed correctly
4 participants