[DataLoader2] Adding support for naive checkpointing #1119

NivekT · 2023-04-05T14:47:27Z

Stack from ghstack:

-> [DataLoader2] Adding support for naive checkpointing #1119

Differential Revision: D44712802

[ghstack-poisoned]

ghstack-source-id: 64cc77b Pull Request resolved: #1119

[ghstack-poisoned]

ghstack-source-id: f9091dd Pull Request resolved: #1119

NivekT · 2023-04-05T16:05:03Z

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Differential Revision: [D44712802](https://our.internmc.facebook.com/intern/diff/D44712802) [ghstack-poisoned]

ghstack-source-id: 9b23881 Pull Request resolved: #1119

Differential Revision: [D44712802](https://our.internmc.facebook.com/intern/diff/D44712802) [ghstack-poisoned]

ghstack-source-id: 2b1dcb1 Pull Request resolved: #1119

Differential Revision: [D44712802](https://our.internmc.facebook.com/intern/diff/D44712802) [ghstack-poisoned]

ghstack-source-id: 42e9901 Pull Request resolved: #1119

NivekT · 2023-04-10T18:50:10Z

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ejguan · 2023-04-18T16:17:52Z

torchdata/dataloader2/dataloader2.py

+            num_of_previously_yielded_batches = state_dict[NUM_PREV_YIELDED_BATCH_KEY_NAME]
+            self._num_prev_yielded_batches = num_of_previously_yielded_batches


Can we combine them into a single line?

ejguan · 2023-04-18T16:19:41Z

torchdata/dataloader2/dataloader2.py

@@ -365,6 +381,28 @@ def _restore_checkpoint_beginning_of_epoch(self) -> None:
        """
        self._seed_generator = self._initial_seed_generator

+    def _restore_naive_checkpoint(self, num_prev_yielded_batches: Optional[int] = None) -> DataLoader2Iterator[T_co]:


When do we need to specify num_prev_yielded_batches?

I have a noob question: Why don't we make it automatically choosing the restoring option between naive or advanced based on if NUM_PREV_YIELDED_BATCH_KEY_NAME in the state_dict?

When do we need to specify num_prev_yielded_batches?

I see it as an option to override or to restore even if self._num_prev_yielded_batches has not been set. We can definitely take it out if you don't find it useful (and add it later if we see a need).

automatically choose

I can imagine there are situations where users only want to restore the randomness state:

the model is only saved once per epoch (so you want DataLoader2 to be in sync)

maybe "naive" restoration is too slow and users just want to restore the randomness state, then do something custom

In these cases, a separate API to only restore randomness state would be good. Let me know if that is not what you are asking.

I see it as an option to override or to restore even if self._num_prev_yielded_batches has not been set.

If needed, users can always skipping number of iterations after doing _restore_checkpoint_beginning_of_epoch, right? I would like to keep the API minimum at first until a solid use case is required.

the model is only saved once per epoch (so you want DataLoader2 to be in sync)

In this case, we only need to save dataloader state once per epoch as well, right? If we want to support fault-tolerant, we need to make sure model has been stored at the same time.

maybe "naive" restoration is too slow and users just want to restore the randomness state, then do something custom

Can we list all the expected scenarios and corresponding API calls in the summary of the PR?

ejguan · 2023-04-18T17:01:23Z

torchdata/dataloader2/dataloader2.py

@@ -222,7 +227,8 @@ def __iter__(self) -> DataLoader2Iterator[T_co]:
            self._reset_iter = False

        self.valid_iterator_id = 0 if self.valid_iterator_id is None else self.valid_iterator_id + 1
-        return DataLoader2Iterator(self, self.valid_iterator_id)
+        self._iterator = DataLoader2Iterator(self, self.valid_iterator_id)


Can we try our best to prevent circular referencing?

facebook-github-bot · 2023-06-09T07:06:19Z

Hi @NivekT!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

NivekT · 2023-06-11T04:22:49Z

Closing. Feel free to re-open if someone else would like to work on this.

[DataLoader2] Adding support for naive checkpointing

2766741

[ghstack-poisoned]

NivekT mentioned this pull request Apr 5, 2023

Temporarily disable CI MacOS doctest and aistore test #1118

Closed

NivekT added a commit that referenced this pull request Apr 5, 2023

[DataLoader2] Adding support for naive checkpointing

41e2d53

ghstack-source-id: 64cc77b Pull Request resolved: #1119

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 5, 2023

Update on "[DataLoader2] Adding support for naive checkpointing"

1a119d2

[ghstack-poisoned]

NivekT requested a review from ejguan April 5, 2023 14:55

NivekT added a commit that referenced this pull request Apr 5, 2023

[DataLoader2] Adding support for naive checkpointing

1107201

ghstack-source-id: f9091dd Pull Request resolved: #1119

Update on "[DataLoader2] Adding support for naive checkpointing"

6b00696

Differential Revision: [D44712802](https://our.internmc.facebook.com/intern/diff/D44712802) [ghstack-poisoned]

NivekT added a commit that referenced this pull request Apr 5, 2023

[DataLoader2] Adding support for naive checkpointing

0002574

ghstack-source-id: 9b23881 Pull Request resolved: #1119

Update on "[DataLoader2] Adding support for naive checkpointing"

78aab55

Differential Revision: [D44712802](https://our.internmc.facebook.com/intern/diff/D44712802) [ghstack-poisoned]

NivekT mentioned this pull request Apr 6, 2023

[DataLoader2] Saving and restoring initial seed generator #1124

Closed

NivekT added a commit that referenced this pull request Apr 6, 2023

[DataLoader2] Adding support for naive checkpointing

960b28c

ghstack-source-id: 2b1dcb1 Pull Request resolved: #1119

Update on "[DataLoader2] Adding support for naive checkpointing"

ba7d77d

Differential Revision: [D44712802](https://our.internmc.facebook.com/intern/diff/D44712802) [ghstack-poisoned]

NivekT added a commit that referenced this pull request Apr 10, 2023

[DataLoader2] Adding support for naive checkpointing

4cfb78a

ghstack-source-id: 42e9901 Pull Request resolved: #1119

ejguan reviewed Apr 18, 2023

View reviewed changes

NivekT closed this Jun 11, 2023

facebook-github-bot deleted the gh/NivekT/114/head branch July 18, 2023 14:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DataLoader2] Adding support for naive checkpointing #1119

[DataLoader2] Adding support for naive checkpointing #1119

Uh oh!

NivekT commented Apr 5, 2023 •

edited

Loading

Uh oh!

NivekT commented Apr 5, 2023

Uh oh!

NivekT commented Apr 10, 2023

Uh oh!

ejguan Apr 18, 2023

Uh oh!

ejguan Apr 18, 2023

Uh oh!

NivekT Apr 18, 2023 •

edited

Loading

Uh oh!

ejguan Apr 18, 2023

Uh oh!

ejguan Apr 18, 2023

Uh oh!

facebook-github-bot commented Jun 9, 2023

Uh oh!

NivekT commented Jun 11, 2023

Uh oh!

Uh oh!

		num_of_previously_yielded_batches = state_dict[NUM_PREV_YIELDED_BATCH_KEY_NAME]
		self._num_prev_yielded_batches = num_of_previously_yielded_batches

[DataLoader2] Adding support for naive checkpointing #1119

[DataLoader2] Adding support for naive checkpointing #1119

Uh oh!

Conversation

NivekT commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NivekT commented Apr 5, 2023

Uh oh!

NivekT commented Apr 10, 2023

Uh oh!

ejguan Apr 18, 2023

Choose a reason for hiding this comment

Uh oh!

ejguan Apr 18, 2023

Choose a reason for hiding this comment

Uh oh!

NivekT Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ejguan Apr 18, 2023

Choose a reason for hiding this comment

Uh oh!

ejguan Apr 18, 2023

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 9, 2023

Process

Uh oh!

NivekT commented Jun 11, 2023

Uh oh!

Uh oh!

NivekT commented Apr 5, 2023 •

edited

Loading

NivekT Apr 18, 2023 •

edited

Loading