Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neighborhood Sampling #64

Closed
tbright17 opened this issue Dec 27, 2018 · 47 comments
Closed

Neighborhood Sampling #64

tbright17 opened this issue Dec 27, 2018 · 47 comments
Labels

Comments

@tbright17
Copy link

Hi Matthias,

I wrote my own dataset and dataloader and I used adjacent matrix instead of edge_index. When I tried to convert adj_matrix to edge_index, I got confused because I have multiple graphs (multiple samples, may have different number of nodes) in one batch. I went over some of the examples and found most of them have batch_size 1. How should I prepare the edge_index in mini-batch setting? I can easily use DenseSAGEConv but I want to try other networks.

Thanks,
Ming

@rusty1s
Copy link
Member

rusty1s commented Dec 27, 2018

Hi,

If I understand you correctly, you are saving and loading dense adjacency matrices and want to convert them to a sparse layout. In general, this approach is not recommended, because of the huge memory overload when stacking dense adjacency matrices block-wise. If you want to convert your dense matrices to sparse matrices, you can make use of the nonzero method from PyTorch.

edge_index = adj.nonzero().t().contiguous()

@tbright17
Copy link
Author

Thanks for the reply. I also want to do mini-batch training so each batch has several graphs. Can I achieve that?

Thanks

@rusty1s
Copy link
Member

rusty1s commented Dec 27, 2018

This can be automatically achieved when using the torch_geometric.data.DataLoader.

@tbright17
Copy link
Author

Thanks. I will check this.

@zeneofa
Copy link

zeneofa commented Dec 27, 2018

Hi,

Thanks again for the awesome package. I have a few question related to this, so I did not want to create separate issue:

  1. From the examples it appears that mini-batching is done by graph only? Is it possible to do this by node, or edge? (I only have one large graph)
  2. For SplineConv, is the entire edge_index, edge_attr and node_set provided for each forward pass?

Background: I have a large dense graph with ~260000 nodes, with approximately 8 node features. From which I am trying to train a node classification model (binary classes), with the help of a 3D manifold (this forms my pseudo-path/co-ordinates). Edges are independent of the manifold itself, though their attributes are derived from it.

@rusty1s
Copy link
Member

rusty1s commented Dec 27, 2018

Hi,
Yes, this is a very important subject and something that is currently not supported. This would probably be done by a new dataloader which samples nodes from the graph and builds a k-hop subgraph around each node (where k corresponds to the number of layers). I will try to add this to the package.
There are already a bunch of papers on this topic, so if you need something specific, please point me to the respective papers.

@zeneofa
Copy link

zeneofa commented Dec 28, 2018

Hi,

I have looked into the L-hop strategy, but with the density of my networks, this will effectively be the entire network with only a few layers.

There is the paper (https://arxiv.org/pdf/1710.10568.pdf), with tensorflow code (https://github.com/thu-ml/stochastic_gcn), that looks promising. The code is not documented, with only a few comments, which makes it quite hard to parse (that and I have not used tensorflow at all). The idea however seems quite straight forward, I am not sure how to integrate this with splineconv though.

@rusty1s rusty1s changed the title how to batch edge_index Neighborhood Sampling Mar 13, 2019
@duncster94
Copy link

Might be worth looking into this as well:
https://arxiv.org/pdf/1801.10247.pdf

They report significant speedups over the Kipf Welling GCN and GraphSAGE with comparable performance.

@familyld
Copy link

Expecting to see the mini-batching support for a single big graph.

@fkhawar
Copy link

fkhawar commented May 22, 2019

Hi, any progress on this or any work around in the meanwhile?

@rusty1s
Copy link
Member

rusty1s commented May 23, 2019

Been working on it :)

@rusty1s
Copy link
Member

rusty1s commented May 31, 2019

I added a first version of NeighborSampler to PyTorch Geometric. It is still undocumented and unfinished (e.g. there is currently no support for num_workers and node probabilities). You can find a training example on the giant Reddit graph in the examples/ directory here. PyTorch Cluster needs to be upgraded to v1.4.0 in order to use.

I would be very happy to discuss the API here and get feedback from you. Currently, you can iterate the loader and access a batch-wise DataFlow object, which defines a computation flow up to num_hops+1 layers. You can print it to see how many nodes are being accessed in each layer, e.g.:

DataFlow(1000<-20000<-60000)

Each block in DataFlow defines a bipartite graph between intermediate layers. Starting from the root nodes, you can propagate messages to the final nodes.

@duncster94
Copy link

@rusty1s Thanks for the update. I'm having trouble getting torch-cluster v1.4.0 installed. v1.2.1 installation gives me no problems.

Here's the error:

cpu/graclus.cpp: In lambda function:
    cpu/graclus.cpp:52:43: error: invalid initialization of reference of type ‘const at::Type&’
from expression of type ‘c10::ScalarType’
       AT_DISPATCH_ALL_TYPES(weight.scalar_type(), "weighted_graclus", [&] {
                                               ^
    /scratch/gobi1/forsterd/pytorchEnv/lib/python3.7/site-packages/torch/lib/include/ATen/Dispatch.h:70:32: note: in definition of macro ‘AT_DISPATCH_ALL_TYPES’
         const at::Type& the_type = TYPE;                                         \
                                    ^
    error: command 'gcc' failed with exit status 1

Any ideas?

@rusty1s
Copy link
Member

rusty1s commented Jun 1, 2019

Yes, you need to update to PyTorch 1.1. Sorry :(

@duncster94
Copy link

@rusty1s That solved it, thanks!

@mainak124
Copy link

@rusty1s Thanks very much for adding the NeighborSampler. If I understood correctly, it supports node iterator which is helpful for supervised task. For unsupervised task, however, I guess having an edge iterator would be necessary. For now, I am just trying to reproduce the result (unsupervised setting) from GraphSAGE paper. Is there any plan of adding that support in future? Thanks a lot!

@rusty1s
Copy link
Member

rusty1s commented Jun 26, 2019

Can you elaborate a bit more? I am not sure if I fully understand. For GraphSAGE unsupervised learning, you need negative sampling and a sampling scheme for near neighbors. Negative sampling is trivial via the use of another randomly shuffling NeighborSampler. For sampling of neighbors, you need to sample node indices from a random walk window and input those to the NeighborSampler for producing a corresponding DataFlow object, .e.g.:

sampler = NeighborSampler(..., shuffle=True)
negative_sampler = NeighborSampler(..., shuffle=True)

for data_flow, negative_data_flow in zip(sampler, negative_sampler):
    n_id = data_flow.n_id  # Get node indices
    rw_sampled_n_id = ...  # Sample node indices from random walks starting from `n_id`   
    # Get another `data_flow` object to the sampled node indices 
    neighboring_data_flow = sampler.__produce__(rw_sampled_n_id)

    # Compute embeddings for each `data_flow` object
    z = model(data_flow)
    negative_z = model(negative_data_flow)
    neighboring_z = model(neighboring_data_flow)
    ...

    # Compute loss
    ...

For random walk sampling, we provide a GPU only and still undocumented functionality in torch-cluster. WDYT?

@mainak124
Copy link

@rusty1s I was thinking of iterating through all the edges in the graph instead of iterating through all the nodes in one epoch. But your approach is much cleaner! Thank you so much! :)

@duncster94
Copy link

Hey @rusty1s quick question: does data flow computation necessarily happen on the CPU? I.e. is it always necessary to send the data flow object to the device you're using after it's computed on the CPU?

@rusty1s
Copy link
Member

rusty1s commented Jul 4, 2019

Yes, I am not sure if GPUs could bring any speed ups here. In the end it would require the whole graph on the GPU, something we want to prevent with this approach.

@JensTheDude
Copy link

@rusty1s Thank you for the graphsage implementation. Are there any plans to implement a sampler for FastGCN as well?

@rusty1s
Copy link
Member

rusty1s commented Jul 11, 2019

Yes :)

@ankitjain451
Copy link

In unsupervised GraphSAGE, the original code iterate over edges instead of nodes and compute the reconstruction loss on sampled edges. Do you have plans to support that? Not sure how iterating over nodes will help in unsupervised case?

@rusty1s
Copy link
Member

rusty1s commented Aug 3, 2019

I am not sure I fully understand why you can not implement the unsupervised GraphSAGE loss with the current neighbor sampling API, see #64 (comment). Can you elaborate?

@ankitjain451
Copy link

ankitjain451 commented Aug 5, 2019

I looked into it and yes you can. I have two points:

  • In your code above, it is not really clear to me how would you do negative sampling. Negative samples for each batch are nodes which are not connected to existing nodes in the batch. How do you ensure that?

  • Also, can you please guide me more on how to structure the loss and validation for unsupervised learning. I want to use reconstruction loss (reconstructing edges based on embeddings). Currently in Tensorflow, I am batching by edges which helps me remove few edges from the training dataset and use max margin loss. Not sure if that provision is built in your code right now?

I am fairly new to Pytorch so don't mind if these simple questions.

@rusty1s
Copy link
Member

rusty1s commented Aug 6, 2019

In your code above, it is not really clear to me how would you do negative sampling. Negative samples for each batch are nodes which are not connected to existing nodes in the batch. How do you ensure that?

You certainly could use more sophisticated negative sampling strategies. A basic strategy to achieve this would be to just resample nodes which already occur in the sampled neighborhood. In practice, this is negligible IMO.

Also, can you please guide me more on how to structure the loss and validation for unsupervised learning. I want to use reconstruction loss (reconstructing edges based on embeddings). Currently in Tensorflow, I am batching by edges which helps me remove few edges from the training dataset and use max margin loss. Not sure if that provision is built in your code right now?

Batching by edges is also a good idea to implement the GraphSAGE loss. Given that you have sampled positive edges and sampled negative edges, you can compute the source and target node embeddings using the NeighborSampler:

pos_edge_index_batch = ...
neg_edge_index_batch = ...

sampler = NeighborSampler(..., shuffle=False)

for z_u_data_flow, z_v_data_flow, z_vn_data_flow in zip(sampler(pos_edge_index_batch[0],
                                                        sampler(pos_edge_index_batch[1],
                                                        sampler(neg_edge_index_batch[1]): 
    z_u = model(z_u_data_flow)
    z_v = model(z_v_data_flow)
    z_vn = model(z_vn_data_flow)
    # Compute loss based on Eq. (1) of https://arxiv.org/pdf/1706.02216.pdf

@ankitjain451
Copy link

Thanks. I think that makes sense. Just trying to get my understanding around your library to implement edge based batching. Can you guide me where and how do I implement this while maintaining your code design. I can submit a pull request with that eventually.

@rusty1s
Copy link
Member

rusty1s commented Aug 6, 2019

I think a simple example on how to implement the GraphSAGE unsupervised loss should be sufficient. Feel free to submit a PR :)

@wwliu555
Copy link

@rusty1s Thanks for the update! I ran into the Segmentation faultwhen running the example reddit.py

Fatal Python error: Segmentation fault

Current thread 0x00007fe21326f740 (most recent call first):
  File "/data/1/weiwen/miniconda3/envs/pyg/lib/python3.7/site-packages/torch_cluster/sampler.py", line 13 in neighbor_sampler
  File "/data/1/weiwen/miniconda3/envs/pyg/lib/python3.7/site-packages/torch_geometric/data/sampler.py", line 154 in __produce__
  File "/data/1/weiwen/miniconda3/envs/pyg/lib/python3.7/site-packages/torch_geometric/data/sampler.py", line 199 in __call__
  File "reddit.py", line 68 in train
  File "reddit.py", line 89 in <module>
Segmentation fault

Any idea how I can fix it? Thanks in advance!

@rusty1s
Copy link
Member

rusty1s commented Aug 19, 2019

Can you run the torch-cluster test suite to verify that everything works as expected?

@wwliu555
Copy link

The test outputs said that my compiler (g++ 4.8.5) was incompatible. I upgraded it and it works fine now. Thanks:)

@wwliu555
Copy link

Heyyy @rusty1s, I'm trying to make NNConv supported for NeighborSampler.

One quick question: how can I obtain edge_attr in each block? Should I add self-loops to the graph first and then use e_id to index them?

@rusty1s
Copy link
Member

rusty1s commented Aug 21, 2019

Exactly, self-loops need to exist in advance in order to use e_id. In addition, you need to set add_self_loops=True so that self-loops are guaranteed to be sampled.

@raphaelsulzer
Copy link

Is there any smart way to do inference with the neighborhood sampler?
I am using it for training on large graphs, but for inference I run out of memory very quickly.

I am thinking of something like sampling n (overlapping) regions from the graph and making sure that every node is sampled at least once. Is that currently possible?

Thanks!

@duncster94
Copy link

@raphaelsulzer If you can fit all the neighbours of any single node in memory you can try this:

ns = NeighborSampler(data,
    size=1.0,
    num_hops=num_hops,
    batch_size=1,
    shuffle=False,
    add_self_loops=True)

This will sample each node and its num_hops neighbourhood so you can do a forward pass on each node one-by-one.

@raphaelsulzer
Copy link

@duncster94 Thank you! That works!

rusty1s added a commit that referenced this issue Sep 2, 2021
* clean heteroconv

* init

* init

* clean up

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>
rusty1s added a commit that referenced this issue Sep 3, 2021
* added HGT DBLP example

* typo

* Merge PyG master (#52)

* Adding the Facebok Page-Page dataset

* type hints

* documentation CI

* py 3.8

* fix links

* fix links

* fail on warning

* fail on warning

* fix doc

Co-authored-by: benedekrozemberczki <benedek.rozemberczki@gmail.com>

* revert

* Fix Documentation Rendering (#51)

* fix doc rendering

* fix linting

* retrigger checks

* remove pytorch 1.7.0 legacy code (#50)

* Fix `copy.deepcopy` within lazy `nn.dense.Linear` (#44)

* fix deepcopy within lazy Linear

* fix merge

* assert exception

* example to doc

* resolve conflict

* resolve conflict

* Add Figure and Equation to `to_hetero` docstring (#60)

* add tex

* add svg + docstring

* typo

* added equation

* Message Passing Hooks (#53)

* add hooks

* docstring

* add docstring

* allow modification of inputs/output

* add test for modifying output

* add additional asserts for modifying output test

* Rename `HeteroData.get_edges` and `HeteroData.get_nodes` (#58)

* rename to_edges and to_nodes

* typo

* `HeteroConv` (#64)

* clean heteroconv

* init

* init

* clean up

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

* fix documentation

* bipartite function

* fix test CI

* remove pillow version

* clean up for merge

* Merge PyG master (#69)

* renaming: PointConv to PointNetConv

* Fix a broken link in datasets/gdelt.py (#2800)

* fix test

* re-add batching of strings

* add quick start table

* gnn cheatsheet

* remove pillow version

Co-authored-by: Dongkwan Kim <todoaskit@gmail.com>

* re-merge

* add lazy column to GNN cheatsheet (#70)

* `to_hetero_with_bases(model)` (#63)

* update

* fix linting

* basisconv

* add ValueError

* to_hetero_with_bases impl done

* add test

* add comments

* add comments

* docstring

* typo

* update figure

* svg

* typo

* add test

* update

* add rgcn equality test

* typos

* update

* typos

* update figures

* generate new svgs

* fix assignment

* rename

* delete sorted edge types

* rename

* add legend

* fix typo

* Test: Check equal outputs of `to_hetero` and `RGCNConv` (#59)

* check equal output

* add sparsetensor test

* check equal output

* add sparsetensor test

* rename

* linting

* add missing import

* `HeteroData` support for `T.NormalizeFeatures` (#56)

* normalize features

* allow normalization of any feature

* in-place div

* normalize features

* allow normalization of any feature

* in-place div

* fix test

* no need to re-assign

* `HeteroData` support for `T.AddSelfLoops` (#54)

* hetero support for AddSelfLoops

* check for edge_index attribute

* f-string

* retrigger checks

* revert bipartite changes

* hetero support for AddSelfLoops

* check for edge_index attribute

* f-string

* retrigger checks

* revert bipartite changes

* merge master

* merge master

* `HeteroData` support for `T.ToSparseTensor` (#55)

* hetero support for ToSparseTensor

* add test

* customize the attribute of SparseTensor.value

* rework sort_edge_index

* hetero support for ToSparseTensor

* add test

* customize the attribute of SparseTensor.value

* rework sort_edge_index

* linting

* `HeteroData` support for `T.ToUndirected` (#57)

* to_undirected

* revert bipartite changes

* coalesce + undirected enhancement

* merge master

* revert bipartite changes

* coalesce + undirected enhancement

* merge master

* clean up

* new default relation type

* fix tests

* resolve merge conflicts

* resolve merge conflicts 2

* resolve merge conflicts 3

* Merge PyG master (#74)

* renaming: PointConv to PointNetConv

* Fix a broken link in datasets/gdelt.py (#2800)

* fix test

* re-add batching of strings

* add quick start table

* gnn cheatsheet

* remove pillow version

* clean up doc for to_dense_batch

* clean up

* add legend to cheatsheet

* Improve terminology (#2837)

I think the previous version of the document uses the term 'symmetric' incorrectly. A symmetric matrix is a square matrix that is is equal to its transpose (https://en.wikipedia.org/wiki/Symmetric_matrix). However, the text is only talking about the shape of the matrix, not its content. Hence, 'square (matrix)' would be the correct term to use.

* Add batch_size input to to_dense_batch (#2838)

* Add batch_size input to to_dense_batch

* to_dense_batch fix typo in batch_size param use

* add typehints

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

* typo

* Added return_attention_weights to TransformerConv. (#2807)

* added return_weights functionality to tranformer

* added return attn weights tests

* flake8

* added typehints

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

* MD17 (#2843)

* Added MD17 dataset

* Updated Documentation

* Added link to sGDML website in doc

* fixed typos in doc and made train variable description clearer

* clean up

* fix linting

* fix doc warning

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

* update doc

* remove forward doc

* add static graph support info to cheatsheet

* fix num_nodes in case edge_index is empty

* fix math formula

* faster GDC import

* lazy import

* lazy import for datasets

* lazy import for nn

* Sequential jittable + traceable

* typo

* typo

* update doc

Co-authored-by: Dongkwan Kim <todoaskit@gmail.com>
Co-authored-by: Markus <markus.zopf@outlook.com>
Co-authored-by: Jimmie <jimmiebtlr@gmail.com>
Co-authored-by: Jinu Sunil <jinu.sunil@gmail.com>
Co-authored-by: Moritz R Schäfer <moritz.schaefer@protonmail.com>

* re-add

* GraphGym cleaned version (#82)

* GraphGym cleaned version

* remove deepsnap dependency

* fix lint errors, part 1

* fix all lint errors

* fix all lint errors

* fix all lint errors

* apply yapf

* Update .gitignore

* Integrate GraphGym into PyG (#85)

* GraphGym cleaned version

* remove deepsnap dependency

* fix lint errors, part 1

* fix all lint errors

* fix all lint errors

* fix all lint errors

* apply yapf

* Integrate graphgym into pyg, keep user API in project root

* fix merge conflict

* fix lint errors

* Make optional dependencies

* merge LICENSE from GraphGym

* add import

* clean up LICENSE

* fix import

* resolve merge conflicts

* resolve merge conflicts 2

* Merge PyG master (#87)

* renaming: PointConv to PointNetConv

* Fix a broken link in datasets/gdelt.py (#2800)

* fix test

* re-add batching of strings

* add quick start table

* gnn cheatsheet

* remove pillow version

* clean up doc for to_dense_batch

* clean up

* add legend to cheatsheet

* Improve terminology (#2837)

I think the previous version of the document uses the term 'symmetric' incorrectly. A symmetric matrix is a square matrix that is is equal to its transpose (https://en.wikipedia.org/wiki/Symmetric_matrix). However, the text is only talking about the shape of the matrix, not its content. Hence, 'square (matrix)' would be the correct term to use.

* Add batch_size input to to_dense_batch (#2838)

* Add batch_size input to to_dense_batch

* to_dense_batch fix typo in batch_size param use

* add typehints

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

* typo

* Added return_attention_weights to TransformerConv. (#2807)

* added return_weights functionality to tranformer

* added return attn weights tests

* flake8

* added typehints

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

* MD17 (#2843)

* Added MD17 dataset

* Updated Documentation

* Added link to sGDML website in doc

* fixed typos in doc and made train variable description clearer

* clean up

* fix linting

* fix doc warning

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

* update doc

* remove forward doc

* add static graph support info to cheatsheet

* fix num_nodes in case edge_index is empty

* fix math formula

* faster GDC import

* lazy import

* lazy import for datasets

* lazy import for nn

* Sequential jittable + traceable

* typo

* typo

* update doc

* Simple models (#2869)

* Inclusion of new backbone models

* Eliminating head from asap.py

* small correction

* Create test_gcn.py

* Update __init__.py

* Update test_gcn.py

* Left only the convolutional simple models

* Tests included

* update

* clean up

* clean up v2

* fix activation

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

* Example for MemPooling. (#2729)

* example for mem pooling

* backprop on kl loss is done at the end of an epoch. Keys in memory layers are trained only on kl loss.

* added learning rate decay. Using PROTIENS_full

* flake8

* reduced lr. increased weight decay

* changed download location

* added comments

* clean up

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

* typos

* fix removeisolatednodes transform in case 'data.num_nodes' is present

* fix XConv with dilation > 1

* fix XConv with dilation > 1

* rgcn link prediction  (#2734)

* implemented LinkPrediction dataset for loading FB15k237

* implemented evaluation for relational link prediction

* implemented R-GCNConf link prediction example

* fixed bug: wrong initial objects in negative_sampling

* changed file downloader urllib.request.urlretrieve  to pytorch.data.download_url; renamed LinkPrediction class to RelationalLinkPredictionDataset

* update dataset

* update example script

* rename

Co-authored-by: Moritz <moritzblum>
Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

* fix gnnexplainer draw kwargs

* remove python-louvain dependency

* allow customization of output in MP jit mode

* fix test for py3.6

* changed normalisation to same norm from instance norm to be robust to small var (#2917)

* add CITATION.cff

* format

* [ci skip]

* [ci skip]

* [ci skip]

* [ci skip]

* [ci skip]

* [ci skip]

* [ci skip]

* [ci skip]

* [ci skip]

* [ci skip]

* [ci skip]

* [ci skip]

* [ci skip]

* [ci skip]

* [ci skip]

* [ci skip]

* add basetransform ABC (#2924)

* clean up BaseTransform

* clean up GATConv and add comments

* add max_num_neighbors as an additional argument

* fix jit GATConv on PyTorch 1.8.0

* fix doc

* fix gnn explainer with existing self-loops

* Rgcn link pred fix (#2946)

* added regularization, removed typo in test

* clean up

Co-authored-by: Moritz <moritzblum>
Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

* typo

* Correct gini coefficient mathcal formula (#2932)

* typo

* typo

* Update from_networkx (#2923)

* Update from_networkx

* Update test

* Update convert.py

* Minor corrections

* Update test_convert.py

* Corrections

* Update test_convert.py

* Case where there are no edges

* Correcting how edge_attr are concatenated

* clean up + new test

* remove unused code

* add union type

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

* fix deterministic ordering in from_networkx

* recursive-include *.jinja files

Co-authored-by: Dongkwan Kim <todoaskit@gmail.com>
Co-authored-by: Markus <markus.zopf@outlook.com>
Co-authored-by: Jimmie <jimmiebtlr@gmail.com>
Co-authored-by: Jinu Sunil <jinu.sunil@gmail.com>
Co-authored-by: Moritz R Schäfer <moritz.schaefer@protonmail.com>
Co-authored-by: PabloAMC <pmorenocf@alumnos.unex.es>
Co-authored-by: Moritz Blum <31183934+moritzblum@users.noreply.github.com>
Co-authored-by: fbragman <fbragman@users.noreply.github.com>
Co-authored-by: Christopher Lee <2824685+CCInc@users.noreply.github.com>
Co-authored-by: Tim Daubenschütz <tim@daubenschuetz.de>

* resolve merge conflicts 3

* resolve merge conflicts 4

* Implementation of the `HGTLoader` + `ogbn-mag` example (#73)

* first try

* update

* HGT Loader

* typo

* first try

* update

* HGT Loader

* typo

* bugfixes

* lazy GATConv

* bugfix

* bugfix

* full working pipeline

* update

* rename

* docstring

* typos

* update

* typo

* typo

* typo

* added comments

* add test

* add tests

* fix example

* rename

* linting

* Random split functionalities (#72)

* link split

* create split

* example tests

* link split tests

* fix linting

* update docstring

* undirected option, refactor and docs

* add num nodes as argument to neg sampling

* clean up + remove single object

* update example

* typo

* fix compose

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

* add basetransform

* typo

* typo

* fix test

* Improve `torch_geometric.data` Documentation (#98)

* update data doc

* typo

* typo

* note

* typo

* add docstring

* only show inherited members for data and hetero_data

* documentation update for batch and dataset

* update doc

* update

* fix

* record_stream

* update

* typo

* add/fix data functionality

* linting

* typo

* `_parent` memory leak fix (#103)

* memory leak fix

* Clean up

* clean up

* bugfix tests

* typos

* fix test

* fix test

* rename reverse

* (Heterogeneous) `NeighborLoader` (#92)

* initial commit

* typo

* neighbor loader functionality + tests

* docstring

* fix docstring

* skip tests

* fix share_memory_

* typo

* typo

* update example

* typo

* share_strategy

* fix cuda calls

* better print

* fix size

* fix print

* final commit

* fix

* some todos

* preprocessed features

* fix to_undirected

* more documentation

* update doc

* fix doc

* fix doc

* Add benchmark code and the example with existing graph classification examples (#93)

* add benchmarking utilities

* update graph classification benchmark

* improve code style

* add pytorch-memlab for benchmark code

* skip some tests when cuda is not available

* add type hint when appropriate

* add seed_everything to improve code

* code refactoring

* code refactoring

* code refactoring

* code improvement

* remove unnecessary dataloader import

* change benchmark interface with decorator

* documentation improvement

* linting

* linting part 2

* linting part 3

* seed_everything

* create utils file

* update

* use utils functions

* fix test

* update the profiler to the latest torch (1.8.1+)

* refactor profiler and add more documentation

* refactor profiler and add more documentation

* resolve lint errors

* resolve lint errors

* update

* clean up test and profile

* fix linting

* add to doc

* fix doc

* typo

* update benchmark

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

* Move `HGTLoader` to `torch_geometric.loader` + clean up (#104)

* move files

* use utils functions

* fix example

* update

* fix tests

* fix seed

* fix linear test

* rename

* Support GraphGym custom modules outside PyG package (#102)

* GraphGym cleaned version

* remove deepsnap dependency

* fix lint errors, part 1

* fix all lint errors

* fix all lint errors

* fix all lint errors

* apply yapf

* Integrate graphgym into pyg, keep user API in project root

* fix merge conflict

* fix lint errors

* Make optional dependencies

* merge LICENSE from GraphGym

* Enable adding GraphGym customized modules outside PyG package

* lint

* Rename `AddTrainValTestMask` to `RandomNodeSplit` (#108)

* initial commit

* rename example

* remove AddTrainValTestMask

* fix linting

* create optimizer config and scheduler config separately (#113)

* create optimizer config and scheduler config separately

* fix format

* import explicitly

Co-authored-by: Dong Wang <dongwang@yannis-air.lan>

* Heterogeneous Graph Tutorial (#83)

* add HG tutorial roadmap

* started working on hg tutorial

* hg_tutorial, some text and .tex figure

* added svg

* hg tutorial content

* fix CI

* text and structure

* finished first draft

* fixed one code example

* fixing conventions

* fixing links

* update svg

* some smaller improvements of tutorial

* improvements on tutorial

* hg-tutorial: fixed compiling issue, added detailed content

* added absolute links

* fixed warnings

* streamlined dataset section

* update svg

* update tutorial

* update 2

Co-authored-by: Jan Eric Lenssen <janeric.lenssen@tu-dortmund.de>

* typo

* Move data loaders to `torch_geometric.loader` (#110)

* move graphsaint

* deprecations

* move clusterloader

* deprecations

* type hints

* move shadow

* typo

* typo

* move datalistloader

* dense data loader

* random node sampler

* fix doc

* Lazy GNN operators (#89)

* lazy cheb conv

* lazy GraphConv

* lazy GATv2Conv

* lazy TAGConv

* lazy FAConv

* lazy FeaStConv

* lazy NNConv

* typo

* fix tests

* lazy SuperGATConv

* lazy SuperGATConv fix

* lazy SplineConv

* fix lazy check

* lazy GravNetConv

* arma conv lazy

* dense linear in gmmconv

* typo

* add test

* lazy GMMConv

* doc

* rename (#116)

* Revisit `MetaPath2Vec` (#114)

* revisit metapath2vec

* update

* typo

* update

* fix doc

* update

* check for attributes rather than key

* Clean up `torch_geometric.profile` further (#111)

* remove print_layer_stats

* typos

* update

* readme highlights and quick tour (#99)

* readme highlights and quick tour

* arch

* arch image

* arch overview

* list categories

* categorization

* category description

* Update README.md

from Matthias

Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>

* improved highlights

* Update README.md

Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>

* Update README.md

Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>

* Update README.md

Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>

* Update README.md

Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>

* minor

* update readme

* update

* update

* update

* update

* fix url

* update

* update

* update

* update

* update

* update

* move ops

* toc

* typo

* typo

* add svgs

* update figure

* fix links

* fix size

* fix size

* typo

Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>

* fix broken links

* fix links

* Heterogeneous Graph Sampler Tutorial (#117)

* initial commit

* address comments

* remove todo

* typo

* Conversion between heterogenous and homogeneous graph objects (#115)

* temp checkpoint (wip, will remove)

* (wip) typed graph conversion

* (wip) typed graph conversion

* (wip) typed graph conversion

* update

* typo

* delete examples

Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

* fix test

* update doc

* deprecate NeighborSampler (#119)

* Move `torch_geometric.data.DataLoader` to `torch_geometric.loader.DataLoader` (#120)

* move dataloader

* rename

* typos

* typos

* fix __cat_dim__

* updategp

* Deprecate `train_test_split_edges` + Modifications to `RandomLinkSplit` (#121)

* deprecate train_test_split_edges

* to device transform

* fix example

* add split_labels argument

* fix autoencoder example

* typos

* add docstring

* ARGVA

* seal

* adress comments

* Create example to load `*.csv` and transfer to `HeteroData` (#76)

* create example to load csv file and transfer to heter-data

* add ipython notebook version load csv with documentation

* address comment

* first version of csv loading doc

* first version of csv loading doc

* suggestion docs/source/notes/loading_csv.rst

Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>

* suggestion docs/source/notes/loading_csv.rst

Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>

* suggestion docs/source/notes/loading_csv.rst

Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>

* suggestion docs/source/notes/loading_csv.rst

Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>

* suggestions csv tutorial

* example script load csv + extract fix

* fixed edge index stacking dimension in example and jupyter nb

* linting

* linting2

* rename

* update

* update

* update

* typo

* typo

* update

* rename

* update tutorial

* typo

* address comments

Co-authored-by: Dong Wang <dongwang@yannis-air.lan>
Co-authored-by: Jan Eric Lenssen <janeric.lenssen@tu-dortmund.de>
Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>

* typo

* fix

* typo

* update

* fix

* fix

Co-authored-by: benedekrozemberczki <benedek.rozemberczki@gmail.com>
Co-authored-by: Rex Ying <rexying@stanford.edu>
Co-authored-by: Dongkwan Kim <todoaskit@gmail.com>
Co-authored-by: Markus <markus.zopf@outlook.com>
Co-authored-by: Jimmie <jimmiebtlr@gmail.com>
Co-authored-by: Jinu Sunil <jinu.sunil@gmail.com>
Co-authored-by: Moritz R Schäfer <moritz.schaefer@protonmail.com>
Co-authored-by: Jiaxuan <youjiaxuan@gmail.com>
Co-authored-by: PabloAMC <pmorenocf@alumnos.unex.es>
Co-authored-by: Moritz Blum <31183934+moritzblum@users.noreply.github.com>
Co-authored-by: fbragman <fbragman@users.noreply.github.com>
Co-authored-by: Christopher Lee <2824685+CCInc@users.noreply.github.com>
Co-authored-by: Tim Daubenschütz <tim@daubenschuetz.de>
Co-authored-by: Yue Zhao <yzhao062@gmail.com>
Co-authored-by: Dong Wang <dongw89@gmail.com>
Co-authored-by: Dong Wang <dongwang@yannis-air.lan>
Co-authored-by: Jan Eric Lenssen <janeric.lenssen@tu-dortmund.de>
@codexhammer
Copy link

Hi @rusty1s,

I am trying to train and classify only a fraction of the total number of classes in a continual learning setting. For example, for Cora dataset with 7 classes:

path = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'Cora')
data = Planetoid(path, "Cora", transform=transforms.NormalizeFeatures())[0]
_, _, classes, train_mask, val_mask, test_mask = data
classes_in_task = [0,1,2]

conditions = torch.BoolTensor( [l in classes_in_task for l in classes[1]] )
train_mask = (train_mask[0], train_mask[1] * conditions)  # Mask all nodes with classes not in [0,1,2] in **train** dataset
val_mask = (val_mask[0], val_mask[1] * conditions) # Mask all nodes with classes not in [0,1,2] in **test** dataset
test_mask = (test_mask[0], test_mask[1] * conditions) # Mask all nodes with classes not in [0,1,2] in **validation** dataset

train_task_nid = np.nonzero(train_mask[1])  # Select node_ids with classes in [0,1,2]
train_task_nid = torch.flatten(train_task_nid)

train_loader = NeighborLoader(
        data,
        num_neighbors=[30]*2,
        batch_size=2000,
        input_nodes=train_task_nid,
        )

I use the train_loader to train the nodes with only specific classes in multiple batches. But, training the network this way for only specific nodes isn't giving me good results. Is there any workaround for this?

Also, during the testing, I want to test all the test nodes as a single batch. How to set this in Neighborloader arguments?

@rusty1s
Copy link
Member

rusty1s commented Jan 12, 2022

What do you mean with "it does not give good results"? Is it due to your learning setup or due to NeighborLoader?

During evaluation, you can omit the use of NeighborLoader if you want to test on a single batch/full-batch.

@codexhammer
Copy link

During evaluation, you can omit the use of NeighborLoader if you want to test on a single batch/full-batch.

If I omit the usage of NeighborLoader, how can I select the input nodes with only the classes [0,1,2] for testing? Is there a function for only selecting a particular set of nodes? Thanks for the reply.

@rusty1s
Copy link
Member

rusty1s commented Jan 12, 2022

After obtaining the model output, you apply masking only the node embeddings you are interested in:

out = model(data.x, data.edge_index)
out_train = out[data.train_mask]

@vincentvic
Copy link

Hi !
I'am trying to train an unsupervised GraphSAGE model on an heterogeneous graph and I have some difficulties to compute the loss with the use of NeighborLoader. Thank you fro the help

@rusty1s
Copy link
Member

rusty1s commented Mar 4, 2022

Can you share some more details (at best in a new issue)? I'm happy to help!

@kayzliu
Copy link

kayzliu commented Apr 15, 2022

@rusty1s how can I know which nodes are the center nodes in the sampled data obtained by torch_geometric.loader.NeighborLoader.

@rusty1s
Copy link
Member

rusty1s commented Apr 16, 2022

The center nodes will be placed first in the sampled output. As such, they are given via slicing based on the batch size (the number of center nodes), that is batch.x[:batch_size]. We also apply this slicing in the linked example. Hope this clarifies your doubts!

@agosztolai
Copy link

agosztolai commented May 12, 2022

Hello,

thanks for the awesome code!

I am trying to implement the negative sampling algorithm used here but using the new NeighborLoader class.

Is there already an implementation of this?

My understanding is that I need to replace

train_loader = NeighborSampler(data.edge_index, sizes=[10, 10], batch_size=256,
shuffle=True, num_nodes=data.num_nodes)

by

kwargs = {'batch_size': 256}
train_loader = NeighborLoader(data,num_neighbors=[10,10],shuffle=True, **kwargs)

Then, when I run

batch = next(iter(train_loader))

the output 'batch' looks exactly the same shape as the input "data", except for the edge_index is smaller and now I have a "batch_size" attribute too. It appears that 'data.x' has been permuted, and data.edge_index has been subsampled.

Are the 256 sampled nodes given by batch.x[:batch_size] (as per your previous answer)?
If so, what are the other nodes batch.x[batch_size:]?
In particular, where are the positive and negative samples?
Also, what is data.edge_index? - is it the induced subgraph of the center nodes or does it also include the positvie/negative samples?

@rusty1s
Copy link
Member

rusty1s commented May 14, 2022

There exists a LinkNeighborLoader class that also supports negative sampling via neg_sampling_ratio.

@v01cano
Copy link

v01cano commented Sep 1, 2023

请问:NeighborSampler(data, size=[1.0, 1.0], num_hops=2, batch_size=batch_size, shuffle=True, add_self_loops=True)
转换为NeighborLoader函数应该如何表达吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests