Fuse all hydras #814

rayg1234 · 2024-08-15T22:54:25Z

In order to make FineTuneHydra work in FM, we decided to consolidate Main:hydra, FMHydra and FinetuneHydra so no more hydra hydras yay!

Main Changes

Fuse finetune-hydra with main-hydra
Modify the output dict format for main-hydra so that it can be used with the FM trainer
Modify the logic to assign output to targets to accept nested dicts as outputs (consistent hydra format), this will allow tasks to work with FM trainer

Config Changes

The updated hydra works with both main and FM with the following changes

main:hydra models need to have a "property" field in the output so that the trainer knows which output property of model to assign to the target (previously this was implicit and would break if multiple heads had the same name)
Got rid of all the modes in finetuneHydra and just simplified to only load from a checkpoint, it now has the following logic:
- if backbone is specified, then build the backbone, else attempt to get it from the starting_checkpoint
- if heads is specified, then build the head, else attempt to get it from the starting_checkpoint

Config for finetuning with data-only

model:
  name: hydra
  finetune_config:
    starting_checkpoint: "./checkpoints/2024-08-07-20-20-16-test/checkpoint.pt"

Config for finetuning with backbone-only

model:
  name: hydra
  finetune_config:
    starting_checkpoint: "./checkpoints/2024-08-07-20-20-16-test/checkpoint.pt"
  heads:
    energy:
      module: equiformer_v2_energy_head
    forces:
      module: equiformer_v2_force_head

Followup: need to make small updates to FM so that all the configs have heads in them (in all configs, the model will instantiate fully the backbone and heads, not tasks). This will allow us to completely remove FMHydra and FinetuneHydra

For now removed logic to rewrite finetune config - decide after getting this to work with FM first

Testing

All previous e2e and finetune tests
Train a oc20 model again and finetune with oc22

codecov · 2024-08-16T04:52:07Z

Codecov Report

Attention: Patch coverage is 93.22034% with 4 lines in your changes missing coverage. Please review.

Files	Patch %	Lines
src/fairchem/core/models/base.py	92.50%	3 Missing ⚠️
src/fairchem/core/common/utils.py	92.30%	1 Missing ⚠️

Files	Coverage Δ
src/fairchem/core/trainers/base_trainer.py	`89.32% <ø> (-0.08%)`	⬇️
src/fairchem/core/trainers/ocp_trainer.py	`69.49% <100.00%> (+0.52%)`	⬆️
src/fairchem/core/common/utils.py	`67.80% <92.30%> (+1.69%)`	⬆️
src/fairchem/core/models/base.py	`88.48% <92.50%> (+1.60%)`	⬆️

... and 7 files with indirect coverage changes

misko

Looks great! LGTM! :hydra:

misko · 2024-08-16T23:16:06Z

src/fairchem/core/common/utils.py

+            errno.ENOENT, "Checkpoint file not found", checkpoint_path
+        )
+    logging.info(f"Loading checkpoint from: {checkpoint_path}")
+    checkpoint = torch.load(checkpoint_path)


Do we want to use map_location here? in case we are loading on CPU or GPU?

misko · 2024-08-16T23:16:47Z

src/fairchem/core/common/utils.py

+    checkpoint = torch.load(checkpoint_path)
+    config = checkpoint["config"]["model"]
+    name = config.pop("name")
+    model = registry.get_model_class(name)(**config)


Should model.to() be called to the target device?

i think we shouldnt try to convert the model to device everywhere and only have that happen in the trainer once

wood-b

This is awesome! A couple minor comments:

See pass_through_comment
It would great to update ocp_hydra_example.yml in this PR, which I think is just adding the property key in outputs. Also, if you make change to ocp_hydra_example.yml this https://github.com/FAIR-Chem/fairchem/blob/main/configs/ocp_hydra_example.yml#L1 should forces or ocp

wood-b · 2024-08-17T01:37:49Z

src/fairchem/core/models/base.py

        otf_graph: bool = True,
+        pass_through_head_outputs: bool = False,


Should pass_through_head_outputs be head specific instead of on/off for all heads? Instead of a single bool there would be one per head. Maybe there is an edge case where multiple heads are specified in the the config one which returns multiple values and another that does not.

Maybe for now if pass_through_head_outputs: True we check that there is only 1 head specified? Or just ignore this edge case for the moment.

to keep things simple, im trying to constrain the output to either dict(str, tensor) or dict(str, dict(str, tensor)) (so either all tensors or all dicts of tensors), otherwise its too hard to manage

misko · 2024-08-19T20:57:17Z

src/fairchem/core/models/base.py


    def forward(self, data: Batch):
+        # get device from input, at least one input must be a tensor to figure out it's device


This looks good to me.

nit; Is it possible to do just device=data.pos.device ? That might be a bit safer, since it wont have to iterate over all values on each batch.

My only worry here is performance , but i dont think it should be too bad if the length of data.values() is reasonable ... and the runtime of a single batch is >500ms ?

i think this line should be negligible, i'd rather not assume anything about whats in the data batch, but i can just save the device and only do this once as well

misko

LGTM 🎉

rayg1234 added 3 commits August 15, 2024 22:54

make hydra compat with multitask

3c97c38

remove finetune hydra

62f08ab

fix tests

edc2d11

rayg1234 changed the title ~~make hydra compat with multitask~~ Fuse all hydras Aug 16, 2024

rayg1234 marked this pull request as ready for review August 16, 2024 01:06

rayg1234 added enhancement New feature or request minor Minor version release labels Aug 16, 2024

update comment

8b7b7a8

rayg1234 requested review from misko, lbluque, wood-b and mshuaibii August 16, 2024 01:15

rayg1234 added 4 commits August 16, 2024 01:21

update logic slightly

be7ba1a

ruff

714f737

fix test

9293cd0

Merge branch 'main' into rgao_fuse_hydras_0

701ac63

misko previously approved these changes Aug 16, 2024

View reviewed changes

wood-b previously approved these changes Aug 17, 2024

View reviewed changes

rayg1234 added 2 commits August 19, 2024 18:18

Merge remote-tracking branch 'origin' into rgao_fuse_hydras_0

98d8098

add map location

9298191

rayg1234 dismissed stale reviews from wood-b and misko via 9298191 August 19, 2024 18:20

rayg1234 added 4 commits August 19, 2024 18:26

ruff

2ba2fb3

fix tests

8b5bfa8

get device from input in hydra

ce96db7

update ocp_hydra_example.yml

39951f9

misko reviewed Aug 19, 2024

View reviewed changes

rayg1234 added 2 commits August 19, 2024 21:03

add logging

01ad88c

update

8daa8ca

misko self-requested a review August 19, 2024 22:12

misko approved these changes Aug 19, 2024

View reviewed changes

rayg1234 enabled auto-merge August 19, 2024 22:17

rayg1234 added this pull request to the merge queue Aug 19, 2024

Merged via the queue into main with commit 1f0f631 Aug 20, 2024
8 checks passed

rayg1234 deleted the rgao_fuse_hydras_0 branch August 20, 2024 00:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse all hydras #814

Fuse all hydras #814

rayg1234 commented Aug 15, 2024 •

edited

Loading

codecov bot commented Aug 16, 2024 •

edited

Loading

misko left a comment

misko Aug 16, 2024

rayg1234 Aug 19, 2024

misko Aug 16, 2024

rayg1234 Aug 19, 2024

wood-b left a comment

wood-b Aug 17, 2024

wood-b Aug 17, 2024

rayg1234 Aug 19, 2024

misko Aug 19, 2024

rayg1234 Aug 19, 2024

misko left a comment

		otf_graph: bool = True,
		pass_through_head_outputs: bool = False,


		def forward(self, data: Batch):
		# get device from input, at least one input must be a tensor to figure out it's device

Fuse all hydras #814

Fuse all hydras #814

Conversation

rayg1234 commented Aug 15, 2024 • edited Loading

Main Changes

Config Changes

Config for finetuning with data-only

Config for finetuning with backbone-only

Testing

codecov bot commented Aug 16, 2024 • edited Loading

Codecov Report

misko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wood-b left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

misko left a comment

Choose a reason for hiding this comment

rayg1234 commented Aug 15, 2024 •

edited

Loading

codecov bot commented Aug 16, 2024 •

edited

Loading