Balanced batch sampler+base dataset #753

misko · 2024-07-09T16:52:45Z

Add Basedataset and corresponding changes to balanced batch sampler.
Added Basedataset.metadata_hasattr() to check if the metadata associated with dataset has a specific attr. In the case of balancedbatchsampler we require 'natoms' to be present.

Replace BalancedBatchSampler's `force_balancing` and `throw_on_error` parameters with `on_error`

…nced partitioning

mshuaibii · 2024-07-26T00:41:33Z

src/fairchem/core/datasets/lmdb_dataset.py

+        assert (
+            len(self.paths) == 1
+        ), f"{type(self)} does not support a list of src paths."
+        self.path = self.paths[0]


Why is a list being fed in here in the first place?

This is a good point. I don't know, i assumed it was because some datasets might take multiple files, potentially multiple lmdb paths. But i dont think this is currently used, I will take this out.

On second pass, looks like we do use multiple paths for ASE datasets,
https://github.com/FAIR-Chem/fairchem/blob/main/tests/core/datasets/test_ase_datasets.py#L89

mshuaibii · 2024-07-26T00:43:34Z

src/fairchem/core/datasets/lmdb_dataset.py

@@ -214,7 +212,7 @@ def get_metadata(self, num_samples: int = 100):
        }


-class SinglePointLmdbDataset(LmdbDataset[BaseData]):
+class SinglePointLmdbDataset(LmdbDataset):


Given the dataset changes, this may be a good time to deprecate these....

Warning - you'll break a few tests. If you break way too many you can revert and we can do this as part of BE.

mshuaibii · 2024-07-26T00:44:42Z

src/fairchem/core/datasets/oc22_lmdb_dataset.py

@@ -114,9 +116,6 @@ def __init__(self, config, transform=None) -> None:
        if self.train_on_oc20_total_energies:
            with open(config["oc20_ref"], "rb") as fp:
                self.oc20_ref = pickle.load(fp)
-        if self.config.get("lin_ref", False):
-            coeff = np.load(self.config["lin_ref"], allow_pickle=True)["coeff"]
-            self.lin_ref = torch.nn.Parameter(torch.tensor(coeff), requires_grad=False)
        self.subsample = aii(self.config.get("subsample", False), bool)


this functionality is technically now supported with ur base dataset right? if so we can remove.

mshuaibii · 2024-07-26T00:47:19Z

src/fairchem/core/trainers/base_trainer.py

@@ -455,7 +456,9 @@ def load_model(self) -> None:
            self.logger.log_summary({"num_params": self.model.num_params})

        if distutils.initialized() and not self.config["noddp"]:


How does this PR behave when --noddp is defined? Would be good to test that functionality.

mshuaibii · 2024-07-31T19:47:26Z

src/fairchem/core/datasets/base_dataset.py

+            msg = msg[:-1] + f"that are smaller than the given max_atoms {max_atoms}."
+        raise ValueError(msg)
+
+    indices = indices[:max_index]


I'm testing this on my end...If i have a config as follows:

dataset: train: format: lmdb src: data/s2ef/all/train/ first_n: 2000000 key_mapping: y: energy force: forces transforms: normalizer: energy: mean: -0.7554450631141663 stdev: 2.887317180633545 forces: mean: 0 stdev: 2.887317180633545 val: src: data/s2ef/all/val_id_30k

Because we have val set up to inherit the train parameters, it will also try to grab first_n for validation/testing as well. To get around this I would need to explicitly define first_n for the validation/test splits which is not ideal.

We should have it such that something like first_n does not try to get passed down to val/test. I don't know the best way we want to tackle this, open to ideas.

Same is true with metadata_path, etc. All of it is trying to be inherited

Totally fair, again great catch @mshuaibii ! Thank you! Looking into it...

* Update BalancedBatchSampler to use datasets' `data_sizes` method Replace BalancedBatchSampler's `force_balancing` and `throw_on_error` parameters with `on_error` * Remove python 3.10 syntax * Documentation * Added set_epoch method * Format * Changed "resolved dataset" message to be a debug log to reduce log spam * clean up batchsampler and tests * base dataset class * move lin_ref to base dataset * inherit basedataset for ase dataset * filter indices prop * added create_dataset fn * yaml load fix * create dataset function instead of filtering in base * remove filtered_indices * make create_dataset and LMDBDatabase importable from datasets * create_dataset cleanup * test create_dataset * use metadata.natoms directly and add it to subset * use self.indices to handle shard * rename _data_sizes * fix Subset of metadata * minor change to metadata, added full path option * import updates * implement get_metadata for datasets; add tests for max_atoms and balanced partitioning * a[:len(a)+1] does not throw error, change to check for this * off by one fix * fixing tests * plug create_dataset into trainer * remove datasetwithsizes; fix base dataset integration; replace close_db with __del__ * lint * add/fix test; * adding new notebook for using fairchem models with NEBs without CatTSunami enumeration (#764) * adding new notebook for using fairchem models with NEBs * adding md tutorials * blocking code cells that arent needed or take too long * Add extra test case for local batch size = 1 * fix example * fix test case * reorg changes * remove metadata_has_sizes in favor of basedataset function metadata_hasattr * fix data_parallel typo * fix up some tests * rename get_metadata to sample_property_metadata * add slow get_metadata for ase; add tests for get_metadata (ase+lmdb); add test for make lmdb metadata sizes * add support for different backends and ddp in pytest * fix tests and balanced batch sampler * make default dataset lmdb * lint * fix tests * test with world_size=0 by default * fix tests * fix tests.. * remove subsample from oc22 dataset * remove old datasets; add test for noddp * remove load balancing from docs * fix docs; add train_split_settings and test for this --------- Co-authored-by: Nima Shoghi <nimashoghi@gmail.com> Co-authored-by: Nima Shoghi <nimashoghi@fb.com> Co-authored-by: lbluque <lbluque@meta.com> Co-authored-by: Brandon <bmwood@meta.com> Co-authored-by: Brook Wander <73855115+brookwander@users.noreply.github.com> Co-authored-by: Muhammed Shuaibi <45150244+mshuaibii@users.noreply.github.com> Co-authored-by: Muhammed Shuaibi <mushuaibi@gmail.com> (cherry picked from commit 04a69b0)

nimashoghi and others added 30 commits August 24, 2023 18:27

Update BalancedBatchSampler to use datasets' data_sizes method

ae4add3

Replace BalancedBatchSampler's `force_balancing` and `throw_on_error` parameters with `on_error`

Remove python 3.10 syntax

01fe2b4

Documentation

2bf8213

Added set_epoch method

7ba5b8a

Format

a367d1e

Changed "resolved dataset" message to be a debug log to reduce log spam

46e3c57

update with main

845bce3

clean up batchsampler and tests

86da069

base dataset class

ff628dd

move lin_ref to base dataset

122197f

inherit basedataset for ase dataset

fb4ce16

filter indices prop

c9e1759

added create_dataset fn

95d3e6f

yaml load fix

b6c640e

create dataset function instead of filtering in base

7fa1904

remove filtered_indices

04c96bf

make create_dataset and LMDBDatabase importable from datasets

ea35b57

create_dataset cleanup

dc98285

test create_dataset

2339916

use metadata.natoms directly and add it to subset

9b58cc7

use self.indices to handle shard

63c03fc

rename _data_sizes

76322aa

merge with main-legacy + no more data_sizes

bb41b13

fix Subset of metadata

b4e22bc

minor change to metadata, added full path option

29b6e68

Merge branch 'main' into balanced-batch-sampler+base-dataset

f9b15cd

import updates

dc59f96

implement get_metadata for datasets; add tests for max_atoms and bala…

64b8df2

…nced partitioning

a[:len(a)+1] does not throw error, change to check for this

80fea27

off by one fix

7447bdb

misko added 2 commits July 23, 2024 21:08

fix tests and balanced batch sampler

fbfe627

make default dataset lmdb

69b5492

misko requested a review from mshuaibii July 23, 2024 21:12

misko added 5 commits July 23, 2024 21:23

lint

2bb2a19

fix tests

1a8fb90

test with world_size=0 by default

43da6fc

fix tests

8508e2f

fix tests..

4426870

mshuaibii reviewed Jul 26, 2024

View reviewed changes

misko force-pushed the balanced-batch-sampler+base-dataset branch from 235965f to 4426870 Compare July 29, 2024 17:52

remove subsample from oc22 dataset

1e2e4aa

misko force-pushed the balanced-batch-sampler+base-dataset branch from 65b5a6f to 1e2e4aa Compare July 29, 2024 18:20

misko added 2 commits July 29, 2024 18:47

remove old datasets; add test for noddp

795689f

remove load balancing from docs

57d86ab

mshuaibii reviewed Jul 31, 2024

View reviewed changes

misko added 3 commits August 1, 2024 00:01

fix docs; add train_split_settings and test for this

e57d701

Merge branch 'main' into balanced-batch-sampler+base-dataset

79f4132

Merge branch 'main' into balanced-batch-sampler+base-dataset

57d668d

rayg1234 approved these changes Aug 2, 2024

View reviewed changes

misko added this pull request to the merge queue Aug 2, 2024

Merged via the queue into main with commit 04a69b0 Aug 2, 2024
7 checks passed

misko deleted the balanced-batch-sampler+base-dataset branch August 2, 2024 18:18

rayg1234 added minor Minor version release enhancement New feature or request labels Aug 13, 2024

wood-b mentioned this pull request Oct 24, 2024

BalancedBatchSampler QoL Updates #566

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Balanced batch sampler+base dataset #753

Balanced batch sampler+base dataset #753

misko commented Jul 9, 2024 •

edited

Loading

mshuaibii Jul 26, 2024

misko Jul 26, 2024

misko Jul 29, 2024

mshuaibii Jul 26, 2024

mshuaibii Jul 26, 2024

mshuaibii Jul 26, 2024

misko Jul 29, 2024

mshuaibii Jul 31, 2024

mshuaibii Jul 31, 2024

misko Jul 31, 2024

		@@ -455,7 +456,9 @@ def load_model(self) -> None:
		self.logger.log_summary({"num_params": self.model.num_params})

		if distutils.initialized() and not self.config["noddp"]:

Balanced batch sampler+base dataset #753

Balanced batch sampler+base dataset #753

Conversation

misko commented Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

misko commented Jul 9, 2024 •

edited

Loading