[Roadmap] Support multiple node/edge type sampling using `NeighborLoader` and friends #4765

Padarn · 2022-06-05T01:46:19Z

🚀 The feature, motivation and pitch

The goal of this roadmap is to understand how we might support sampling across multiple node or edge types in NeighborLoader and LinkNeighborLoader. As an example, currently, the constructor of NeighborLoader takes a single NodeType with an optional index. For example

NeighborLoader(
    data=data, 
    input_nodes=('paper', index)
)

However, a user (for example #4707) may reasonably want to instead sample from multiple edge types, so we would like to propose a way to support this directly. Thus we would like to support this for both NeighborLoader and LinkNeighborLoader.

Interface

A proposed interface for this could be to extend the supported input types to include a tuple of tuples for the sampling index. This would probably mean also supporting different negative sampling ratios. An example of the new interface would be:

NeighborLoader(
    data=data, 
    input_nodes=(('paper', index), ('author', index))
)

The output of this for NeighborLoader could remain the same, but the sampling would be extended to cover the different input types.

For the case of LinkNeighborLoader, the interface change could be similar

LinkNeighborLoader(
    data=data, 
    edge_label_index =(
        (('paper', 'author'), index), 
        (('author', 'paper'), index))
)

For the LinkNeighborLoader we store dictionaries data[type].edge_label and data[type].edge_label_index to allow users to keep track of which edges the sampled edges correspond to

data['edge_type'].edge_label
data['edge_type'].edge_label_index

This is easy to extend to multiple types. But we may want to implement a simple way to find out which edge types were sampled in a given minibatch.

data.get_sampled_edge_types

Implementation

To implement this we need to decide whether sampling happens randomly across all types in a batch, or whether the sampling is done for each type independently. To keep it simple for now, we would keep a single batch_size and assume the user wants to sample randomly from all of the types/indexes provided, so for example

NeighborLoader(
    data=data, 
    batch_size=N
    input_nodes=(('paper', index), ('author', index2))
)

would return a random sample of N nodes from union([index, index2]).

Risks

This adds complication in usage, as to effectively predict on a mini-batch you would need to compute the heads for each node/edge type. This needs to be explored through some examples and help from the community.

Roadmap

NeighborLoader.
- Support multiple node types in NeighborLoader. Support multiple node type sampling in NeighborLoader #5013
- Examples of multiple node types in NeighborLoader.
LinkNeighborLoader.
- Support multiple edge types in LinkNeighborLoader.
- Examples of multiple edges types in LinkNeighborLoader.
- Support negative_sampling_ratio in LinkNeighborLoader for different edge types.
Benchmark converting list of tuples to dictionary as suggested in comment: [Roadmap] Support multiple node/edge type sampling using NeighborLoader and friends #4765 (comment)

Any feedback and help from the community is highly appreciated!

cc: @rusty1s

The text was updated successfully, but these errors were encountered:

rusty1s · 2022-06-06T06:45:14Z

Thanks @Padarn.

To avoid confusion with the existing implementation, when using the new functionality the properties data.edge_label and data.edge_label_index would instead be stored on the edge type themselves:

Can you clarify? I think just lifting the current interface to support lists of types as input is totally fine.

In addition, I am not sure what "Support multiple node types in LinkNeighborLoader" means. Do you mean edge types instead?

One challenge (which may reduce runtime) is to define what we pass to the parent constructor in super().__init__(). I think we would need to pass in a list of tuples now, e.g.,

super().__init__([("paper, 0), ("paper, 1), ...], ...)

and convert this back to dictionaries of node indices. This may be expensive to compute since we are operating on Python primitives here, so I would appreciate if we could do some benchmarking first. Let me know your thoughts!

Padarn · 2022-06-06T08:23:39Z

In addition, I am not sure what "Support multiple node types in LinkNeighborLoader" means. Do you mean edge types instead?

Thanks, fixed.

Can you clarify? I think just lifting the current interface to support lists of types as input is totally fine.

Sorry I mistakenly remembered that we stored the index and labels without edge type. I've updated this to make it clear.

Padarn · 2022-06-06T08:24:06Z

so I would appreciate if we could do some benchmarking first. Let me know your thoughts!

Very good point. Will add an item for this.

Padarn · 2022-06-25T13:48:23Z

I ran some benchmarks using the new benchmarks/loader/neighbor_loader.py and the code changes in https://github.com/Padarn/pytorch_geometric/tree/padarn/neigbour-benchmark - I didn't try to optimize at all yet, just implement exactly what was suggested: Converting to and from dict representation

super().__init__([("paper, 0), ("paper, 1), ...], ...

Here are some results.

Old Version

Evaluation sampling with all neighbors
batch size=16384, iterations=135, runtimes=[3.165, 3.033, 3.074], average runtime=3.091
batch size=8192, iterations=270, runtimes=[3.783, 3.555, 3.67], average runtime=3.669
batch size=4096, iterations=540, runtimes=[4.04, 3.893, 3.879], average runtime=3.937

New Version

Evaluation sampling with all neighbors
batch size=16384, iterations=135, runtimes=[3.126, 3.094, 3.067], average runtime=3.096
batch size=4096, iterations=540, runtimes=[4.18, 4.061, 4.003], average runtime=4.081
batch size=2048, iterations=1080, runtimes=[4.275, 4.391, 4.329], average runtime=4.332

So its definitely not free.

Padarn · 2022-06-25T13:51:12Z

I guess the obvious way to do this would be to create a flat graph, like

input_nodes = {'paper':[0,1,2], 'author':[0,1,2]}

becomes

input_nodes = [0, 1, 2, 0+num_paper_nodes, 1+num_paper_nodes,2+num_paper_nodes]

The offsets could be added as a property to the NeighborSampler.

rusty1s · 2022-06-26T08:41:03Z

Thanks for the benchmarks. They don't look that worrying :) Ideally, we could use the current scheme when only one node type is used, and use the slower variant whenever this is not the case.

I think your proposed solution of a flat graph won't necessarily work as you will still need to re-construct the the node type for each index in a later step.

Padarn · 2022-06-26T08:57:53Z

I think your proposed solution of a flat graph won't necessarily work as you will still need to re-construct the the node type for each index in a later step.

True. I was thinking it might be quicker as it could probably be done in tensor operations, but I see that might not actually be any better.

Ideally, we could use the current scheme when only one node type is used, and use the slower variant whenever this is not the case.

Agreed!

Padarn added feature 1 - Priority P1 loader labels Jun 5, 2022

Padarn self-assigned this Jun 5, 2022

Padarn mentioned this issue Jul 20, 2022

Support multiple node type sampling in NeighborLoader #5013

Closed

2 tasks

rusty1s added roadmap and removed feature labels Aug 8, 2022

rusty1s changed the title ~~Support multiple node/edge type sampling using NeighborLoader and friends~~ [Roadmap] Support multiple node/edge type sampling using NeighborLoader and friends Aug 8, 2022

Padarn mentioned this issue Sep 24, 2022

Support multiple node type sampling in NeighborLoader (V2) #5521

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] Support multiple node/edge type sampling using `NeighborLoader` and friends #4765

[Roadmap] Support multiple node/edge type sampling using `NeighborLoader` and friends #4765

Padarn commented Jun 5, 2022 •

edited

Loading

rusty1s commented Jun 6, 2022 •

edited

Loading

Padarn commented Jun 6, 2022

Padarn commented Jun 6, 2022

Padarn commented Jun 25, 2022

Padarn commented Jun 25, 2022 •

edited

Loading

rusty1s commented Jun 26, 2022

Padarn commented Jun 26, 2022

[Roadmap] Support multiple node/edge type sampling using NeighborLoader and friends #4765

[Roadmap] Support multiple node/edge type sampling using NeighborLoader and friends #4765

Comments

Padarn commented Jun 5, 2022 • edited Loading

🚀 The feature, motivation and pitch

Interface

Implementation

Risks

Roadmap

rusty1s commented Jun 6, 2022 • edited Loading

Padarn commented Jun 6, 2022

Padarn commented Jun 6, 2022

Padarn commented Jun 25, 2022

Padarn commented Jun 25, 2022 • edited Loading

rusty1s commented Jun 26, 2022

Padarn commented Jun 26, 2022

[Roadmap] Support multiple node/edge type sampling using `NeighborLoader` and friends #4765

[Roadmap] Support multiple node/edge type sampling using `NeighborLoader` and friends #4765

Padarn commented Jun 5, 2022 •

edited

Loading

rusty1s commented Jun 6, 2022 •

edited

Loading

Padarn commented Jun 25, 2022 •

edited

Loading