Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Roadmap] Support multiple node/edge type sampling using NeighborLoader and friends #4765

Open
1 of 8 tasks
Padarn opened this issue Jun 5, 2022 · 7 comments
Open
1 of 8 tasks

Comments

@Padarn
Copy link
Contributor

Padarn commented Jun 5, 2022

🚀 The feature, motivation and pitch

The goal of this roadmap is to understand how we might support sampling across multiple node or edge types in NeighborLoader and LinkNeighborLoader. As an example, currently, the constructor of NeighborLoader takes a single NodeType with an optional index. For example

NeighborLoader(
    data=data, 
    input_nodes=('paper', index)
)

However, a user (for example #4707) may reasonably want to instead sample from multiple edge types, so we would like to propose a way to support this directly. Thus we would like to support this for both NeighborLoader and LinkNeighborLoader.

Interface

A proposed interface for this could be to extend the supported input types to include a tuple of tuples for the sampling index. This would probably mean also supporting different negative sampling ratios. An example of the new interface would be:

NeighborLoader(
    data=data, 
    input_nodes=(('paper', index), ('author', index))
)

The output of this for NeighborLoader could remain the same, but the sampling would be extended to cover the different input types.

For the case of LinkNeighborLoader, the interface change could be similar

LinkNeighborLoader(
    data=data, 
    edge_label_index =(
        (('paper', 'author'), index), 
        (('author', 'paper'), index))
)

For the LinkNeighborLoader we store dictionaries data[type].edge_label and data[type].edge_label_index to allow users to keep track of which edges the sampled edges correspond to

data['edge_type'].edge_label
data['edge_type'].edge_label_index

This is easy to extend to multiple types. But we may want to implement a simple way to find out which edge types were sampled in a given minibatch.

data.get_sampled_edge_types

Implementation

To implement this we need to decide whether sampling happens randomly across all types in a batch, or whether the sampling is done for each type independently. To keep it simple for now, we would keep a single batch_size and assume the user wants to sample randomly from all of the types/indexes provided, so for example

NeighborLoader(
    data=data, 
    batch_size=N
    input_nodes=(('paper', index), ('author', index2))
)

would return a random sample of N nodes from union([index, index2]).

Risks

This adds complication in usage, as to effectively predict on a mini-batch you would need to compute the heads for each node/edge type. This needs to be explored through some examples and help from the community.

Roadmap

Any feedback and help from the community is highly appreciated!

cc: @rusty1s

@rusty1s
Copy link
Member

rusty1s commented Jun 6, 2022

Thanks @Padarn.

To avoid confusion with the existing implementation, when using the new functionality the properties data.edge_label and data.edge_label_index would instead be stored on the edge type themselves:

Can you clarify? I think just lifting the current interface to support lists of types as input is totally fine.

In addition, I am not sure what "Support multiple node types in LinkNeighborLoader" means. Do you mean edge types instead?

One challenge (which may reduce runtime) is to define what we pass to the parent constructor in super().__init__(). I think we would need to pass in a list of tuples now, e.g.,

super().__init__([("paper, 0), ("paper, 1), ...], ...)

and convert this back to dictionaries of node indices. This may be expensive to compute since we are operating on Python primitives here, so I would appreciate if we could do some benchmarking first. Let me know your thoughts!

@Padarn
Copy link
Contributor Author

Padarn commented Jun 6, 2022

In addition, I am not sure what "Support multiple node types in LinkNeighborLoader" means. Do you mean edge types instead?

Thanks, fixed.

Can you clarify? I think just lifting the current interface to support lists of types as input is totally fine.

Sorry I mistakenly remembered that we stored the index and labels without edge type. I've updated this to make it clear.

@Padarn
Copy link
Contributor Author

Padarn commented Jun 6, 2022

so I would appreciate if we could do some benchmarking first. Let me know your thoughts!

Very good point. Will add an item for this.

@Padarn
Copy link
Contributor Author

Padarn commented Jun 25, 2022

I ran some benchmarks using the new benchmarks/loader/neighbor_loader.py and the code changes in https://github.com/Padarn/pytorch_geometric/tree/padarn/neigbour-benchmark - I didn't try to optimize at all yet, just implement exactly what was suggested: Converting to and from dict representation

super().__init__([("paper, 0), ("paper, 1), ...], ...

Here are some results.

Old Version

Evaluation sampling with all neighbors
batch size=16384, iterations=135, runtimes=[3.165, 3.033, 3.074], average runtime=3.091
batch size=8192, iterations=270, runtimes=[3.783, 3.555, 3.67], average runtime=3.669
batch size=4096, iterations=540, runtimes=[4.04, 3.893, 3.879], average runtime=3.937

New Version

Evaluation sampling with all neighbors
batch size=16384, iterations=135, runtimes=[3.126, 3.094, 3.067], average runtime=3.096
batch size=4096, iterations=540, runtimes=[4.18, 4.061, 4.003], average runtime=4.081
batch size=2048, iterations=1080, runtimes=[4.275, 4.391, 4.329], average runtime=4.332

So its definitely not free.

@Padarn
Copy link
Contributor Author

Padarn commented Jun 25, 2022

I guess the obvious way to do this would be to create a flat graph, like

input_nodes = {'paper':[0,1,2], 'author':[0,1,2]} 

becomes

input_nodes = [0, 1, 2, 0+num_paper_nodes, 1+num_paper_nodes,2+num_paper_nodes]

The offsets could be added as a property to the NeighborSampler.

@rusty1s
Copy link
Member

rusty1s commented Jun 26, 2022

Thanks for the benchmarks. They don't look that worrying :) Ideally, we could use the current scheme when only one node type is used, and use the slower variant whenever this is not the case.

I think your proposed solution of a flat graph won't necessarily work as you will still need to re-construct the the node type for each index in a later step.

@Padarn
Copy link
Contributor Author

Padarn commented Jun 26, 2022

I think your proposed solution of a flat graph won't necessarily work as you will still need to re-construct the the node type for each index in a later step.

True. I was thinking it might be quicker as it could probably be done in tensor operations, but I see that might not actually be any better.

Ideally, we could use the current scheme when only one node type is used, and use the slower variant whenever this is not the case.

Agreed!

@rusty1s rusty1s added roadmap and removed feature labels Aug 8, 2022
@rusty1s rusty1s changed the title Support multiple node/edge type sampling using NeighborLoader and friends [Roadmap] Support multiple node/edge type sampling using NeighborLoader and friends Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants