Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph heads ci tests #208

Merged
merged 8 commits into from
Apr 1, 2024
Merged

Graph heads ci tests #208

merged 8 commits into from
Apr 1, 2024

Conversation

allaffa
Copy link
Collaborator

@allaffa allaffa commented Jan 18, 2024

The capability to use a full stack of convolutional layers when only nodal predictions are needed was never tested in the code.
This PR:

  • updated pre-existing code to maintain support of type="conv" as a choice for nodal heads of HydraGNN
  • has if-then-else conditions to make sure that get_conv is called with the proper number of arguments. All models except SCFStack take two input arguments. SCFStack takes also last_layer, which is a Boolean variable.

@allaffa allaffa added the bug Something isn't working label Jan 18, 2024
@allaffa allaffa self-assigned this Jan 18, 2024
@allaffa
Copy link
Collaborator Author

allaffa commented Jan 18, 2024

@pzhanggit
The fixes provided by this PR allow us to have a working baseline graph auto encoder in HydraGNN

@allaffa allaffa force-pushed the graph_heads_ci_tests branch from 4947f79 to 310089b Compare January 19, 2024 03:15
@allaffa
Copy link
Collaborator Author

allaffa commented Jan 19, 2024

I changed like 225 of the file hydragnn/utils/distributed.py as follows:

model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True)

Setting find_unused_parameters=True does not make the code crash, and should allow to track the parameters.

I also added the following lined to the train() function inside train_validate_test.py

        with record_function("forward"):
            data = data.to(get_device())
            pred = model(data)
            # Print unused parameters
            unused_params = [name for name, param in model.module.named_parameters() if not param.requires_grad]
            if unused_params:
                print("Unused Parameters:")
                for name in unused_params:
                    print(name)
            else:
                print("No unused parameters.")
            loss, tasks_loss = model.module.loss(pred, data.y, head_index)

However, no unused parameters are tracked.

@pzhanggit
Copy link
Collaborator

I changed like 225 of the file hydragnn/utils/distributed.py as follows:

model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True)

Setting find_unused_parameters=True does not make the code crash, and should allow to track the parameters.

I also added the following lined to the train() function inside train_validate_test.py

        with record_function("forward"):
            data = data.to(get_device())
            pred = model(data)
            # Print unused parameters
            unused_params = [name for name, param in model.module.named_parameters() if not param.requires_grad]
            if unused_params:
                print("Unused Parameters:")
                for name in unused_params:
                    print(name)
            else:
                print("No unused parameters.")
            loss, tasks_loss = model.module.loss(pred, data.y, head_index)

However, no unused parameters are tracked.

https://discuss.pytorch.org/t/how-to-find-the-unused-parameters-in-network/63948/5

@allaffa
Copy link
Collaborator Author

allaffa commented Jan 20, 2024

I changed like 225 of the file hydragnn/utils/distributed.py as follows:
model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True)
Setting find_unused_parameters=True does not make the code crash, and should allow to track the parameters.
I also added the following lined to the train() function inside train_validate_test.py

        with record_function("forward"):
            data = data.to(get_device())
            pred = model(data)
            # Print unused parameters
            unused_params = [name for name, param in model.module.named_parameters() if not param.requires_grad]
            if unused_params:
                print("Unused Parameters:")
                for name in unused_params:
                    print(name)
            else:
                print("No unused parameters.")
            loss, tasks_loss = model.module.loss(pred, data.y, head_index)

However, no unused parameters are tracked.

https://discuss.pytorch.org/t/how-to-find-the-unused-parameters-in-network/63948/5

@pzhanggit thanks for the link.

If you do not use DDP, then you need to call model.parameters() as in the link you provided. If we use DDP, then replacing model with model.module (as we already do in other parts of the code) and calling model.module.parameters() should do the same.

Am I missing something?

hydragnn/models/Base.py Outdated Show resolved Hide resolved
@allaffa allaffa force-pushed the graph_heads_ci_tests branch from e6ce4aa to e580a88 Compare January 23, 2024 00:41
@allaffa
Copy link
Collaborator Author

allaffa commented Jan 23, 2024

@JustinBakerMath

This PR fixed some bugs introduced by successive code developments and re-establishes the capabilities to create a graph auto encoder in HydraGNN that solely relies on message passing layers for node-to-node mapping predictions.
This is not meant to replace the shared MLP, it just aims at providing the user with an additional (optional) architecture for nodal predictions. This way, the user can eider choose to build (1) a stack of message passing layers with a shared MLP on top, or (2) use message passing layers all the way till the end.

The performance of the graph auto encoder using only message passing layers is pretty disappointing on the unit tests. However, this is in line with previous runs observed by Pei a long time ago on the FePt dataset.
Since we want to further develop generative graph diffusion models on top of HydraGNN, I would like to see how graph auto encoders behave in that context.

Would you mind helping me make sure that my Changs do not mess up the SchNet layer?
I remember that for some equivariant models you turn off the batch-norm. I would like to get your help make sure that I did not add bugs when the graph auto encoder uses equivariant message passing backend.

Thanks,

@JustinBakerMath
Copy link
Collaborator

@JustinBakerMath

This PR fixed some bugs introduced by successive code developments and re-establishes the capabilities to create a graph auto encoder in HydraGNN that solely relies on message passing layers for node-to-node mapping predictions. This is not meant to replace the shared MLP, it just aims at providing the user with an additional (optional) architecture for nodal predictions. This way, the user can eider choose to build (1) a stack of message passing layers with a shared MLP on top, or (2) use message passing layers all the way till the end.

The performance of the graph auto encoder using only message passing layers is pretty disappointing on the unit tests. However, this is in line with previous runs observed by Pei a long time ago on the FePt dataset. Since we want to further develop generative graph diffusion models on top of HydraGNN, I would like to see how graph auto encoders behave in that context.

Would you mind helping me make sure that my Changs do not mess up the SchNet layer? I remember that for some equivariant models you turn off the batch-norm. I would like to get your help make sure that I did not add bugs when the graph auto encoder uses equivariant message passing backend.

Thanks,

Thank you for your patience.

These updates have not introduced any errors into the implementation of EGCL or SchNet.

It has added batch normalization to the convolutional "head". The performance of the batch normalization can be assessed by overriding the _init_node_conv() function in the children of Base. This would be done in the same fashion as the existing overrides for the _init_conv() function.

This isn't necessary for this PR. However, going forward, I would be happy to assist in assessing the batch normalization performance.

@allaffa allaffa requested a review from pzhanggit January 26, 2024 19:49
@pzhanggit
Copy link
Collaborator

The tests failed due to that the changes introduced requiring torch-geometric>=2.4.0 which stops supporting python 3.7, as in pyg-team/pytorch_geometric#7939. We will come back to the PR later.

@allaffa
Copy link
Collaborator Author

allaffa commented Mar 4, 2024

@pzhanggit
Do you think that the new upgraded versions of PR#210 (now merged into the master) would allow for the tests of this PR to pass after rebasing?

@pzhanggit
Copy link
Collaborator

@pzhanggit Do you think that the new upgraded versions of PR#210 (now merged into the master) would allow for the tests of this PR to pass after rebasing?

Yes, I will rebase now

@pzhanggit pzhanggit force-pushed the graph_heads_ci_tests branch from a215810 to d2a551d Compare March 4, 2024 16:53
@jychoi-hpc
Copy link
Member

PYG released a new version (https://github.com/pyg-team/pytorch_geometric/releases/tag/2.5.2), which includes the fix for the problem. Can we try if pyg 2.5.2 works?

torch_geometric==2.5.2

@pzhanggit pzhanggit force-pushed the graph_heads_ci_tests branch from d2a551d to afe91b1 Compare April 1, 2024 15:53
@pzhanggit pzhanggit force-pushed the graph_heads_ci_tests branch from 6177412 to d8cd181 Compare April 1, 2024 18:00
@allaffa allaffa merged commit c520b80 into ORNL:main Apr 1, 2024
2 checks passed
@allaffa allaffa deleted the graph_heads_ci_tests branch April 1, 2024 18:33
RylieWeaver pushed a commit to RylieWeaver/HydraGNN that referenced this pull request Oct 17, 2024
* upgrade pyg 2.5.2

* JSON file for convolutional heads added and test_graph updated

* thresholds increased for EGNN and SchNet

* Update test_graphs.py

Threshold increased for SchNet in unit test for convolutional heads

* update DimeNet weights initilization by Justin

* hyperparameter adjust for conv_head tests

* format

* relax error tolerance in conv_head for GIN

---------

Co-authored-by: Choi <choij@ornl.gov>
Co-authored-by: Zhang, Pei <zhangp1@ornl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants