Support multiple node type sampling in `NeighborLoader` #5013

Padarn · 2022-07-20T07:53:51Z

This PR adds functionality to allow for multiple node types to be sampled in NeighbourLoader.

The interface looks as was discussed in the roadmap (#4765):

NeighbourLoader(
   input_nodes=[
     ('paper', torch.LongTensor([0,1,2])), 
     ('author', torch.LongTensor([0,1,2])) 
   ]
  ...
)

Internally, it converts this to a list of tuples.

[('paper', 0), ('paper', 1),....]

This is not very efficient, but benchmarks #4765 (comment) showed it to be acceptable.

TODO:

Add tests
Add support for None instead of providing specific nodes for some node types

Addresses #4765

Padarn · 2022-08-09T13:03:35Z

torch_geometric/loader/neighbor_loader.py

-    def __call__(self, index: Union[List[int], Tensor]):
-        if not isinstance(index, torch.LongTensor):
-            index = torch.LongTensor(index)
+    def __call__(self, index: Union[List[int], Tensor, HeteroNodeList]):


@wsad1 Can I get your opinion on this (note the PR is overall still somewhat WIP).

I was finding the logic here getting quite complicated. Do you have any suggestions for simplification? One way would be to split the sampler into one class for hetero and one for non-hetero, and handling any conversions in the collate_fn.

Thanks for the update. I'll check it out once I am back from a short vacation on Tuesday morning.

No rush, thank you.

Padarn · 2022-08-15T00:51:09Z

test/loader/test_neighbor_loader.py

+        ], batch_size=batch_size, directed=directed, shuffle=False)
+
+    for batch1, batch2, batch3 in zip(loader, loader2, loader3):
+        assert torch.allclose(batch1['paper'].x, batch2['paper'].x)


This test is flakey - works locally but not in the CI. Any suggestions for what might be better?

wsad1

Thanks at @Padarn for the update. I am still looking at your code to see if we can simplify things. Added some initial comments.

test/loader/test_neighbor_loader.py

Padarn · 2022-08-19T00:37:02Z

Thanks. FYI I'll also do this for the link loader and add examples in separate PRs, so we can iterate on the complexity later too if it seems okay but not perfect.

Padarn · 2022-09-03T12:32:37Z

Hey @mananshah99 do you also want to take a look at this one? May need to merge it with #5312

wsad1

Added some more comments will take a look again.

wsad1 · 2022-09-05T09:11:59Z

torch_geometric/loader/neighbor_loader.py

+def to_hetero_list(input_nodes: List[Tuple[str, Tensor]]) -> HeteroNodeList:
+    return [(node_type, i) for node_type, index in input_nodes for i in index]


There might be cases where we return nodes of only one type. Should we mention that this can't be used to train a model which predicts on multiple node types?

Hmm I didn't fully understand this - do you mean some samples from the neighbor sampler will only contain a single type?

Yeah it might contain samples from only one type.
I am not sure what the use case of this multiple node sampling is. If an envisioned use case is is predicting on multiple node types in a hetero graph, that might not work.

torch_geometric/loader/neighbor_loader.py

Co-authored-by: Jinu Sunil <jinu.sunil@gmail.com>

codecov · 2022-09-05T12:29:37Z

Codecov Report

Merging #5013 (6ce3cf0) into master (79617e0) will decrease coverage by 1.94%.
The diff coverage is 86.00%.

❗ Current head 6ce3cf0 differs from pull request most recent head 9af624f. Consider uploading reports for the commit 9af624f to get more accurate results

@@            Coverage Diff             @@
##           master    #5013      +/-   ##
==========================================
- Coverage   85.27%   83.32%   -1.95%     
==========================================
  Files         338      338              
  Lines       18683    18709      +26     
==========================================
- Hits        15931    15590     -341     
- Misses       2752     3119     +367

Impacted Files	Coverage Δ
torch_geometric/data/lightning_datamodule.py	`48.82% <ø> (ø)`
torch_geometric/loader/neighbor_loader.py	`92.22% <84.78%> (-2.55%)`	⬇️
torch_geometric/typing.py	`100.00% <100.00%> (ø)`
torch_geometric/nn/models/dimenet_utils.py	`0.00% <0.00%> (-75.52%)`	⬇️
torch_geometric/nn/models/dimenet.py	`14.51% <0.00%> (-53.00%)`	⬇️
torch_geometric/profile/profile.py	`37.89% <0.00%> (-26.32%)`	⬇️
torch_geometric/nn/conv/utils/typing.py	`81.25% <0.00%> (-17.50%)`	⬇️
torch_geometric/nn/inits.py	`67.85% <0.00%> (-7.15%)`	⬇️
torch_geometric/transforms/add_self_loops.py	`94.44% <0.00%> (-5.56%)`	⬇️
torch_geometric/nn/resolver.py	`88.88% <0.00%> (-5.56%)`	⬇️
... and 12 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

mananshah99

Thanks for the ping @Padarn. Left a few comments, happy to chat further as well.

mananshah99 · 2022-09-06T23:02:43Z

torch_geometric/loader/neighbor_loader.py

@@ -147,7 +152,8 @@ def _set_num_neighbors_and_num_hops(self, num_neighbors):
        # Add at least one element to the list to ensure `max` is well-defined
        self.num_hops = max([0] + [len(v) for v in num_neighbors.values()])

-    def _sparse_neighbor_sample(self, index: Tensor):
+    def _sparse_neighbor_sample(self, index: Union[List[int], Tensor]):


Why? Don't we convert to tensors beforehand?

Ah because that made the logic in __call__ quite confusing. You would have first to check whether or not it was hetero and then if it was handling the conversion separately if it was a mixed input. To me it seemed simpler to handle these cases in the individual functions, as it was actually less code.

torch_geometric/loader/neighbor_loader.py

mananshah99 · 2022-09-06T23:06:26Z

torch_geometric/loader/neighbor_loader.py

+            batch_sizes = {
+                node_type: index.numel()
+                for node_type, index in index_dict.items()
+            }

-        if self.data_cls != 'custom' and issubclass(self.data_cls, Data):
-            return self._sparse_neighbor_sample(index) + (index.numel(), )
+            return self._hetero_sparse_neighbor_sample(index_dict) + (
+                batch_sizes, )


This makes sense, and you are right that it somewhat conflicts with the interface in #5312. I think this is totally okay; that interface will likely change significantly over the coming week or two (we also need to support link-level neighbor sampling, etc.).

If it is alright with you, I would propose first merging 5312, and then adapting that interface as part of this PR to support returning a dict of batch sizes for each node type. Wdyt?

Agreed! I see its merged now so let me rethink this a little based on what you've added.

mananshah99 · 2022-09-06T23:07:49Z

torch_geometric/loader/neighbor_loader.py

+def get_mixed_sampling_nodes(data: HeteroData,
+                             input_nodes: List[InputNodes]) -> SamplingNodes:


Document or perhaps rename? mixed isn't super clear to me (this is just getting sampling nodes for different node types, right?)

Yes good point, and I think you're right we could merge these two functions. I originally spit them because it was being used in the sampler init, but I've removed this. Let me clean it up.

mananshah99 · 2022-09-06T23:08:07Z

torch_geometric/loader/neighbor_loader.py

+def get_node_types(input_nodes: List[Tuple[str, Tensor]]) -> List[str]:
+    return [node_type for node_type, index in input_nodes]
+
+
+def get_node_list(input_nodes: List[Tuple[str, Tensor]]) -> HeteroNodeList:
+    return [(node_type, i) for node_type, index in input_nodes for i in index]


Do these need to be separate functions?

👍 good point

Padarn · 2022-09-10T01:46:11Z

Thanks for the comments @mananshah99. Sorry I've been busy with my day job, will review and address over the weekend.

Co-authored-by: Manan Shah <manan.shah.777@gmail.com>

Padarn · 2022-09-10T02:05:23Z

On second look I think I'll wait for you to finish your current refactoring PRs, the code has changed a lot and I'll have to fit into the new interface. Will focus on helping review you PRs first and then rework this one.

mananshah99 · 2022-09-20T22:49:57Z

On second look I think I'll wait for you to finish your current refactoring PRs, the code has changed a lot and I'll have to fit into the new interface. Will focus on helping review you PRs first and then rework this one.

Thank you for accommodating :) The refactoring PRs are complete now, and the interface is mostly stable. Happy to help move this implementation over behind the new interface, it's pretty cool.

Padarn · 2022-09-21T00:11:57Z

Great! I can refactor later this week. I'll probably start a new PR as I think most of the code need to move, will definitely ask you for a review.

Padarn mentioned this pull request Jul 20, 2022

[Roadmap] Support multiple node/edge type sampling using NeighborLoader and friends #4765

Open

8 tasks

rusty1s assigned Padarn Jul 21, 2022

rusty1s added feature 1 - Priority P1 loader labels Jul 21, 2022

Padarn force-pushed the padarn/neighbour-multi-node branch from 2d173e5 to 43b78e8 Compare July 24, 2022 01:22

Padarn force-pushed the padarn/neighbour-multi-node branch from 43b78e8 to 63b70b3 Compare August 9, 2022 12:42

Padarn commented Aug 9, 2022

View reviewed changes

Padarn changed the title ~~[WIP] Support multiple node type sampling in NeighborLoader~~ Support multiple node type sampling in NeighborLoader Aug 14, 2022

Padarn requested review from rusty1s and wsad1 August 14, 2022 04:15

Padarn force-pushed the padarn/neighbour-multi-node branch from af2eec9 to 043ec22 Compare August 14, 2022 06:44

Padarn commented Aug 15, 2022

View reviewed changes

wsad1 reviewed Aug 17, 2022

View reviewed changes

test/loader/test_neighbor_loader.py Show resolved Hide resolved

wsad1 reviewed Sep 5, 2022

View reviewed changes

Padarn added 13 commits September 5, 2022 20:23

wip nodelist type

0b7552e

implement multi-type

15d268e

refactor

470927c

add to docstring

d94f642

add changelog

08ff1fc

add changelog

c606a28

add test, support None

a85ab7a

add batch size per type

8b49bdc

move convert

e5b616b

fix for feature store

f73a0cc

fix tests

bfaa27f

change to allclose

f2e4e9b

change to allclose

0582dd0

Padarn and others added 3 commits September 5, 2022 20:24

update mixed node tests

92ee5bd

Update torch_geometric/loader/neighbor_loader.py

57fdb1e

Co-authored-by: Jinu Sunil <jinu.sunil@gmail.com>

fix naming from suggestion

6ce3cf0

Padarn force-pushed the padarn/neighbour-multi-node branch from 67a3a54 to 6ce3cf0 Compare September 5, 2022 12:24

mananshah99 reviewed Sep 6, 2022

View reviewed changes

Update torch_geometric/loader/neighbor_loader.py

9af624f

Co-authored-by: Manan Shah <manan.shah.777@gmail.com>

Padarn closed this Sep 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple node type sampling in `NeighborLoader` #5013

Support multiple node type sampling in `NeighborLoader` #5013

Padarn commented Jul 20, 2022 •

edited

Loading

Padarn Aug 9, 2022 •

edited

Loading

wsad1 Aug 14, 2022

Padarn Aug 14, 2022

Padarn Aug 15, 2022

wsad1 left a comment •

edited

Loading

Padarn commented Aug 19, 2022

Padarn commented Sep 3, 2022

wsad1 left a comment

wsad1 Sep 5, 2022

Padarn Sep 5, 2022

wsad1 Sep 6, 2022

codecov bot commented Sep 5, 2022 •

edited

Loading

mananshah99 left a comment

mananshah99 Sep 6, 2022

Padarn Sep 10, 2022

mananshah99 Sep 6, 2022

Padarn Sep 10, 2022

mananshah99 Sep 6, 2022

Padarn Sep 10, 2022

mananshah99 Sep 6, 2022

Padarn Sep 10, 2022

Padarn commented Sep 10, 2022

Padarn commented Sep 10, 2022

mananshah99 commented Sep 20, 2022

Padarn commented Sep 21, 2022

		def to_hetero_list(input_nodes: List[Tuple[str, Tensor]]) -> HeteroNodeList:
		return [(node_type, i) for node_type, index in input_nodes for i in index]

		def get_mixed_sampling_nodes(data: HeteroData,
		input_nodes: List[InputNodes]) -> SamplingNodes:

Support multiple node type sampling in NeighborLoader #5013

Support multiple node type sampling in NeighborLoader #5013

Conversation

Padarn commented Jul 20, 2022 • edited Loading

Padarn Aug 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wsad1 left a comment • edited Loading

Choose a reason for hiding this comment

Padarn commented Aug 19, 2022

Padarn commented Sep 3, 2022

wsad1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 5, 2022 • edited Loading

Codecov Report

mananshah99 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Padarn commented Sep 10, 2022

Padarn commented Sep 10, 2022

mananshah99 commented Sep 20, 2022

Padarn commented Sep 21, 2022

Support multiple node type sampling in `NeighborLoader` #5013

Support multiple node type sampling in `NeighborLoader` #5013

Padarn commented Jul 20, 2022 •

edited

Loading

Padarn Aug 9, 2022 •

edited

Loading

wsad1 left a comment •

edited

Loading

codecov bot commented Sep 5, 2022 •

edited

Loading