New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

using tf.data for fit method in DeepEnsemble model #890

Closed

hstojic wants to merge 2 commits into develop from hstojic/de_tf_data

Collaborator

hstojic commented Jan 17, 2025

this simple change should improve memory handling, should be better optimized for GPUs, and generally gives more control to the user over preparing data for training


          tf data for fit method

46a3c97

hstojic requested review from uri-granta and avullo

January 17, 2025 14:19


          added validation_split option back

3934f62

uri-granta reviewed

View reviewed changes

Collaborator

uri-granta left a comment

Various comments/comments. Happy to review again (including the tests) once the tests are passing.

trieste/models/keras/models.py

+                      batch_size: int,
+                      num_points: int,
+                      validation_split: float = 0.0,
+                  ) -> Union[tf.data.Dataset, tuple[tf.data.Dataset, tf.data.Dataset]]:

Collaborator

uri-granta Jan 20, 2025

might be nicer to always return a tuple?

Suggested change

      
                ) -> Union[tf.data.Dataset, tuple[tf.data.Dataset, tf.data.Dataset]]:
          
                ) -> tuple[tf.data.Dataset, Optional[tf.data.Dataset]]]:

trieste/models/keras/models.py

+                              If validation_split > 0, returns a tuple of (training_dataset, validation_dataset)
+                      """
+                      if not 0.0 <= validation_split < 1.0:
+                          raise ValueError("validation_split must be between 0 and 1")

Collaborator

uri-granta Jan 20, 2025

Suggested change

      
                        raise ValueError("validation_split must be between 0 and 1")
          
                        raise ValueError(f"validation_split must be between 0 and 1: got {validation_split}")

trieste/models/keras/models.py

+                      if validation_split > 0:
+                          # Calculate split sizes
+                          val_size = int(num_points * validation_split)

Collaborator

uri-granta Jan 20, 2025

Suggested change

      
                        val_size = int(num_points * validation_split)
          
                        val_size = round(num_points * validation_split)

trieste/models/keras/models.py

+                      tf_data = self.prepare_tf_data(
+                          x,
+                          y,
+                          batch_size=fit_args_copy["batch_size"],

Collaborator

uri-granta Jan 20, 2025

"batch_size" isn't guaranteed to exist for a user-supplied fit_args

Suggested change

      
                        batch_size=fit_args_copy["batch_size"],
          
                        batch_size=fit_args_copy.get("batch_size"),

Collaborator Author

hstojic Feb 5, 2025

ah, well spotted, I was remembering BatchOptimizer...

trieste/models/keras/models.py

                       x, y = self.prepare_dataset(dataset)
+                      validation_split = fit_args_copy.pop("validation_split", 0.0)
+                      tf_data = self.prepare_tf_data(

Collaborator

uri-granta Jan 20, 2025

(if you change the return type above as suggested)

Suggested change

      
                    tf_data = self.prepare_tf_data(
          
                    train_dataset, val_dataset = self.prepare_tf_data(

trieste/models/keras/models.py

+                      if validation_split > 0:
+                          train_dataset, val_dataset = tf_data
+                          fit_args_copy["validation_data"] = val_dataset

Collaborator

uri-granta Jan 20, 2025

should we maybe raise an exception if "train_dataset, val_dataset" is already present in the fit_args?

trieste/models/keras/models.py

Comment on lines +476 to +480

+                          history = self.model.fit(
+                              train_dataset, **fit_args_copy, initial_epoch=self._absolute_epochs
+                          )
+                      else:
+                          history = self.model.fit(tf_data, **fit_args_copy, initial_epoch=self._absolute_epochs)

Collaborator

uri-granta Jan 20, 2025

Suggested change

      
                        history = self.model.fit(
          
                            train_dataset, **fit_args_copy, initial_epoch=self._absolute_epochs
          
                        )
          
                    else:
          
                        history = self.model.fit(tf_data, **fit_args_copy, initial_epoch=self._absolute_epochs)
          
                    history = self.model.fit(tf_data, **fit_args_copy, initial_epoch=self._absolute_epochs)

trieste/models/keras/models.py

+                          # Original behavior when no validation split is requested
+                          return (
+                              dataset.prefetch(tf.data.AUTOTUNE)
+                              .shuffle(train_size, reshuffle_each_iteration=True)

Collaborator

uri-granta Jan 20, 2025

(I think?)

Suggested change

      
                            .shuffle(train_size, reshuffle_each_iteration=True)
          
                            .shuffle(num_points, reshuffle_each_iteration=True)

trieste/models/keras/models.py

+                          return train_dataset, val_dataset
+                      else:
+                          # Original behavior when no validation split is requested

Collaborator

uri-granta Jan 20, 2025

Q: is this really the same as the original behaviour?

uri-granta self-requested a review

January 20, 2025 10:10

pio-neil commented Jan 20, 2025

I do have a small concern about the use of tf.Dataset.shuffle. When I did some testing with this before, the shuffle buffer (which is tf.Dataset's internal method of shuffling data) used around 18GB of extra memory. This was with a dataset with 30 million rows, with 15 inputs and one output, and a batch size of 1000. The shuffle buffer also has an impact on speed, but I suspect this is relatively minor compared to the model training time.

This might not be such a problem with smaller datasets. So perhaps it would be a good idea to do some benchmarking?

However, it's also not clear to me why we're introducing shuffling by default here, when AFAICT it wasn't there before? This seems like a change of behaviour. Do we expect this to improve model accuracy? It may be better to let the user of Trieste control this, rather than making it the default behaviour?

Collaborator Author

hstojic commented Feb 5, 2025

I do have a small concern about the use of tf.Dataset.shuffle. When I did some testing with this before, the shuffle buffer (which is tf.Dataset's internal method of shuffling data) used around 18GB of extra memory. This was with a dataset with 30 million rows, with 15 inputs and one output, and a batch size of 1000. The shuffle buffer also has an impact on speed, but I suspect this is relatively minor compared to the model training time.

This might not be such a problem with smaller datasets. So perhaps it would be a good idea to do some benchmarking?

I have already done some testing, on much smaller data than that, haven't seen any adverse effects.
Users that handle bigger datasets now can handle data preparation themselves by subclassing the model and overwriting the new method, if there are any issues with speed/memory.

However, it's also not clear to me why we're introducing shuffling by default here, when AFAICT it wasn't there before? This seems like a change of behaviour. Do we expect this to improve model accuracy? It may be better to let the user of Trieste control this, rather than making it the default behaviour?

shuffle argument is by default True in Keras' fit method, hence why its default here, but its definitely needed - we are typically doing many epochs here and shuffling data at each epoch helps model training - so yes, should improve model accuracy. The user can always overwrite the new method, that was the purpose of extracting this data preparation bit into a method.

pio-neil commented Feb 5, 2025

However, it's also not clear to me why we're introducing shuffling by default here, when AFAICT it wasn't there before? This seems like a change of behaviour. Do we expect this to improve model accuracy? It may be better to let the user of Trieste control this, rather than making it the default behaviour?

shuffle argument is by default True in Keras' fit method, hence why its default here, but its definitely needed - we are typically doing many epochs here and shuffling data at each epoch helps model training - so yes, should improve model accuracy. The user can always overwrite the new method, that was the purpose of extracting this data preparation bit into a method.

Right, yes, I see. I was just wondering why we were doing things differently in terms of shuffling, but we aren't (since fit ignores the shuffle argument for datasets):

shuffle: Boolean, whether to shuffle the training data before each epoch. This argument is ignored when x is a keras.utils.PyDataset, tf.data.Dataset, torch.utils.data.DataLoader or Python generator function.

hstojic changed the title ~~using tf.data for fit method instead of~~ using tf.data for fit method in DeepEnsemble model

Collaborator Author

hstojic commented Feb 5, 2025

after some further thinking, this is currently unnecessary complication of the code, as for smallish datasets that we mainly deal with here current code is good enough

hstojic closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet