SplitBatchRunner #1713

emailweixu · 2024-12-04T20:10:50Z

Sometime we want to use a large batch for training but the memory is not enough. In this kind of situation, we can split the large batch into smaller batches and use lean_function to run the small batches. This can reduces the memory usage because the intermediate tensors will all be released.

So SplitBatchRunner is implemented to make this process very simple.

qiang-sean-liu · 2024-12-05T18:34:57Z

alf/utils/lean_function.py

+        for i in range(0, batch_size, self._max_batch_size):
+            batch_args, batch_kwargs = alf.nest.map_structure(
+                lambda x: x[i:i + self._max_batch_size], (args, kwargs))
+            outputs.append(


are we supposed to have the same results from the full batch and the split batch? what if there are dropouts and batch norm in the model?

As the doc-string notes, dropout is not supported for training.
BatchNorm will also be affected. But if the batch_size is not too small, its effect on training should be small.

added notes

hnyu · 2024-12-09T22:52:22Z

alf/utils/lean_function.py

+
+    Args:
+        model (nn.Module): the model to run
+        max_batch_size (int): the maximum batch size to run the model.


Need to add a comment that if max_batch_size <=0, what will happen.

hnyu · 2024-12-09T22:56:00Z

alf/utils/lean_function.py

+
+        return alf.nest.map_structure(lambda *x: torch.cat(x, dim=0), *outputs)
+
+    def original_forward(self, *args, **kwargs):


Where is this function used?

This is used by user of this class.

hnyu · 2024-12-09T22:57:04Z

alf/utils/lean_function.py

+
+
+class SplitBatchRunner(torch.nn.Module):
+    """Split the input into smaller batches and run the model on each batch.


Need to clarify that lean_function will be used.

comment added

SplitBatchRunner

1c92ed2

emailweixu assigned hnyu and qiang-sean-liu Dec 4, 2024

qiang-sean-liu reviewed Dec 5, 2024

View reviewed changes

hnyu reviewed Dec 9, 2024

View reviewed changes

Improve comments

f2a7f10

hnyu approved these changes Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SplitBatchRunner #1713

SplitBatchRunner #1713

emailweixu commented Dec 4, 2024

qiang-sean-liu Dec 5, 2024

emailweixu Dec 6, 2024

emailweixu Dec 16, 2024

hnyu Dec 9, 2024

emailweixu Dec 16, 2024

hnyu Dec 9, 2024

emailweixu Dec 16, 2024

hnyu Dec 9, 2024

emailweixu Dec 16, 2024


		return alf.nest.map_structure(lambda x: torch.cat(x, dim=0), outputs)

		def original_forward(self, args, *kwargs):



		class SplitBatchRunner(torch.nn.Module):
		"""Split the input into smaller batches and run the model on each batch.

SplitBatchRunner #1713

Are you sure you want to change the base?

SplitBatchRunner #1713

Conversation

emailweixu commented Dec 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment