Per Batch Padded Dataset #281

M-R-Schaefer · 2024-06-05T20:09:42Z

Adds a dataset which does not use tf.data and pads samples per batch instead of everything to the largest size.
Very advantageous for training datasets containing very differently sized samples.

M-R-Schaefer · 2024-06-06T06:26:17Z

Needs a one line change but is otherwise ready for review.
Note the change in the training config.

PythonFZ · 2024-06-07T05:38:26Z

apax/cli/templates/train_config_full.yaml

@@ -16,8 +16,11 @@ data:
  #train_data_path: <PATH>
  #val_data_path: <PATH>
  #test_data_path: <PATH>
+  dataset:
+    name: cached


I think I'd prefer type over name. But I see why you don't want to use that as attribute. Maybe method or processing?

I used name to be consistent with the other uses of discriminative unions in our config.
I think method would be fine but I think we should only use one keyword for this purpose

I'll change it to processing

My reasoning against name was, that dataset:name reminds me of train, test or SPICE etc.

ah yeah I can see how this might be confusing. thanks, I changed it 👍

PythonFZ · 2024-06-07T05:40:51Z

apax/config/train_config.py

+    """Dataset which pads everything (atoms, neighbors)
+    to the next larges power of two.
+    This limits the compute wasted due to padding at the (negligible)
+    cost of some recompilations.
+    The NL is computed on-the-fly in parallel for `num_workers` of batches.
+    Does not use tf.data.


I assume PB stand for parallel batches. Maybe mention that once somewhere in the docstring, so that it is clear. I would also not write MP but materials project

The class name stands for PerBatchPadded . I guess I can just write it out.same for materials project.

actually, I don't see "PB" written anywhere

PythonFZ · 2024-06-07T05:46:05Z

apax/data/input_pipeline.py

+        n_epochs,
+        n_jit_steps=1,
+        buffer_size=20,
+        num_workers=10,


Does it make sense to set num_workers to None and get the default from the number of available cores?

Probably. I'll update the default.

PythonFZ · 2024-06-07T05:47:19Z

apax/data/input_pipeline.py

+        if n_jit_steps > 1:
+            raise "PerBatchPaddedDataset is not yet compatible with multi step jit"


I'm not sure but in general it's better to raise a concrete value, like here raise TypeError(msg...) instead, isn't it?

Yes, not doing so was not intended. Thanks for pointing it out

PythonFZ · 2024-06-07T05:50:21Z

apax/utils/convert.py

+def transpose_dict_of_lists(dict_of_lists: dict):
+    list_of_dicts = []
+    keys = list(dict_of_lists.keys())
+
+    for i in range(len(dict_of_lists[keys[0]])):
+        data = {k: dict_of_lists[k][i] for k in keys}
+        list_of_dicts.append(data)
+
+    return list_of_dicts


missing unit test

for more information, see https://pre-commit.ci

M-R-Schaefer · 2024-06-07T08:53:20Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

PythonFZ

I have not tested it locally but looks good 👍

M-R-Schaefer added 6 commits June 3, 2024 12:37

outline of PerBatchPaddedDataset

e2beaf6

added multiprocessing BatchProcessor class

80ca14d

updated config file

1c534c6

updated test configs

7cc567e

added doc strings to datastes

b6e09d6

added warning filter for os.fork()

bfb1152

M-R-Schaefer requested a review from PythonFZ June 6, 2024 06:26

M-R-Schaefer added 2 commits June 6, 2024 09:38

removed cahce dir from input config

94db422

removed cache_path from full config

4905419

Tetracarbonylnickel approved these changes Jun 6, 2024

View reviewed changes

PythonFZ requested changes Jun 7, 2024

View reviewed changes

M-R-Schaefer and others added 6 commits June 7, 2024 10:07

made num workers default to num cpus in system

c4b7bb2

set default n workers to num cpus

24f60d7

updated name key to processing

a89e728

added unit test for transpose dict of lsit

93cd227

Merge branch 'dev' into uff_ds

dbae8bc

[pre-commit.ci] auto fixes from pre-commit.com hooks

dcb44d6

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

d11dbfd

for more information, see https://pre-commit.ci

M-R-Schaefer requested a review from PythonFZ June 7, 2024 09:03

PythonFZ approved these changes Jun 7, 2024

View reviewed changes

M-R-Schaefer merged commit c5042ae into dev Jun 7, 2024
2 checks passed

M-R-Schaefer deleted the uff_ds branch June 7, 2024 11:59

M-R-Schaefer mentioned this pull request Jun 9, 2024

Input pipeline overhaul #180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per Batch Padded Dataset #281

Per Batch Padded Dataset #281

M-R-Schaefer commented Jun 5, 2024 •

edited

Loading

M-R-Schaefer commented Jun 6, 2024

PythonFZ Jun 7, 2024

M-R-Schaefer Jun 7, 2024

M-R-Schaefer Jun 7, 2024

PythonFZ Jun 7, 2024

M-R-Schaefer Jun 7, 2024

PythonFZ Jun 7, 2024

M-R-Schaefer Jun 7, 2024

M-R-Schaefer Jun 7, 2024

PythonFZ Jun 7, 2024

M-R-Schaefer Jun 7, 2024

PythonFZ Jun 7, 2024

M-R-Schaefer Jun 7, 2024

PythonFZ Jun 7, 2024

M-R-Schaefer commented Jun 7, 2024

PythonFZ left a comment

		if n_jit_steps > 1:
		raise "PerBatchPaddedDataset is not yet compatible with multi step jit"

Per Batch Padded Dataset #281

Per Batch Padded Dataset #281

Conversation

M-R-Schaefer commented Jun 5, 2024 • edited Loading

M-R-Schaefer commented Jun 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

M-R-Schaefer commented Jun 7, 2024

PythonFZ left a comment

Choose a reason for hiding this comment

M-R-Schaefer commented Jun 5, 2024 •

edited

Loading