Neighborhood Sampling #64

tbright17 · 2018-12-27T00:08:37Z

Hi Matthias,

I wrote my own dataset and dataloader and I used adjacent matrix instead of edge_index. When I tried to convert adj_matrix to edge_index, I got confused because I have multiple graphs (multiple samples, may have different number of nodes) in one batch. I went over some of the examples and found most of them have batch_size 1. How should I prepare the edge_index in mini-batch setting? I can easily use DenseSAGEConv but I want to try other networks.

Thanks,
Ming

rusty1s · 2018-12-27T11:48:56Z

Hi,

If I understand you correctly, you are saving and loading dense adjacency matrices and want to convert them to a sparse layout. In general, this approach is not recommended, because of the huge memory overload when stacking dense adjacency matrices block-wise. If you want to convert your dense matrices to sparse matrices, you can make use of the nonzero method from PyTorch.

edge_index = adj.nonzero().t().contiguous()

tbright17 · 2018-12-27T17:47:14Z

Thanks for the reply. I also want to do mini-batch training so each batch has several graphs. Can I achieve that?

Thanks

rusty1s · 2018-12-27T17:48:40Z

This can be automatically achieved when using the torch_geometric.data.DataLoader.

tbright17 · 2018-12-27T17:49:46Z

Thanks. I will check this.

zeneofa · 2018-12-27T21:19:36Z

Hi,

Thanks again for the awesome package. I have a few question related to this, so I did not want to create separate issue:

From the examples it appears that mini-batching is done by graph only? Is it possible to do this by node, or edge? (I only have one large graph)
For SplineConv, is the entire edge_index, edge_attr and node_set provided for each forward pass?

Background: I have a large dense graph with ~260000 nodes, with approximately 8 node features. From which I am trying to train a node classification model (binary classes), with the help of a 3D manifold (this forms my pseudo-path/co-ordinates). Edges are independent of the manifold itself, though their attributes are derived from it.

rusty1s · 2018-12-27T21:40:24Z

Hi,
Yes, this is a very important subject and something that is currently not supported. This would probably be done by a new dataloader which samples nodes from the graph and builds a k-hop subgraph around each node (where k corresponds to the number of layers). I will try to add this to the package.
There are already a bunch of papers on this topic, so if you need something specific, please point me to the respective papers.

zeneofa · 2018-12-28T18:22:52Z

Hi,

I have looked into the L-hop strategy, but with the density of my networks, this will effectively be the entire network with only a few layers.

There is the paper (https://arxiv.org/pdf/1710.10568.pdf), with tensorflow code (https://github.com/thu-ml/stochastic_gcn), that looks promising. The code is not documented, with only a few comments, which makes it quite hard to parse (that and I have not used tensorflow at all). The idea however seems quite straight forward, I am not sure how to integrate this with splineconv though.

duncster94 · 2019-04-18T14:58:47Z

Might be worth looking into this as well:
https://arxiv.org/pdf/1801.10247.pdf

They report significant speedups over the Kipf Welling GCN and GraphSAGE with comparable performance.

familyld · 2019-04-30T07:54:32Z

Expecting to see the mini-batching support for a single big graph.

fkhawar · 2019-05-22T23:19:36Z

Hi, any progress on this or any work around in the meanwhile?

rusty1s · 2019-05-23T03:23:04Z

Been working on it :)

rusty1s · 2019-05-31T09:54:42Z

I added a first version of NeighborSampler to PyTorch Geometric. It is still undocumented and unfinished (e.g. there is currently no support for num_workers and node probabilities). You can find a training example on the giant Reddit graph in the examples/ directory here. PyTorch Cluster needs to be upgraded to v1.4.0 in order to use.

I would be very happy to discuss the API here and get feedback from you. Currently, you can iterate the loader and access a batch-wise DataFlow object, which defines a computation flow up to num_hops+1 layers. You can print it to see how many nodes are being accessed in each layer, e.g.:

DataFlow(1000<-20000<-60000)

Each block in DataFlow defines a bipartite graph between intermediate layers. Starting from the root nodes, you can propagate messages to the final nodes.

duncster94 · 2019-05-31T21:22:43Z

@rusty1s Thanks for the update. I'm having trouble getting torch-cluster v1.4.0 installed. v1.2.1 installation gives me no problems.

Here's the error:

cpu/graclus.cpp: In lambda function:
    cpu/graclus.cpp:52:43: error: invalid initialization of reference of type ‘const at::Type&’
from expression of type ‘c10::ScalarType’
       AT_DISPATCH_ALL_TYPES(weight.scalar_type(), "weighted_graclus", [&] {
                                               ^
    /scratch/gobi1/forsterd/pytorchEnv/lib/python3.7/site-packages/torch/lib/include/ATen/Dispatch.h:70:32: note: in definition of macro ‘AT_DISPATCH_ALL_TYPES’
         const at::Type& the_type = TYPE;                                         \
                                    ^
    error: command 'gcc' failed with exit status 1

Any ideas?

rusty1s · 2019-06-01T03:13:40Z

Yes, you need to update to PyTorch 1.1. Sorry :(

duncster94 · 2019-06-01T04:12:48Z

@rusty1s That solved it, thanks!

mainak124 · 2019-06-25T21:34:17Z

@rusty1s Thanks very much for adding the NeighborSampler. If I understood correctly, it supports node iterator which is helpful for supervised task. For unsupervised task, however, I guess having an edge iterator would be necessary. For now, I am just trying to reproduce the result (unsupervised setting) from GraphSAGE paper. Is there any plan of adding that support in future? Thanks a lot!

rusty1s · 2019-06-26T06:13:02Z

Can you elaborate a bit more? I am not sure if I fully understand. For GraphSAGE unsupervised learning, you need negative sampling and a sampling scheme for near neighbors. Negative sampling is trivial via the use of another randomly shuffling NeighborSampler. For sampling of neighbors, you need to sample node indices from a random walk window and input those to the NeighborSampler for producing a corresponding DataFlow object, .e.g.:

sampler = NeighborSampler(..., shuffle=True)
negative_sampler = NeighborSampler(..., shuffle=True)

for data_flow, negative_data_flow in zip(sampler, negative_sampler):
    n_id = data_flow.n_id  # Get node indices
    rw_sampled_n_id = ...  # Sample node indices from random walks starting from `n_id`   
    # Get another `data_flow` object to the sampled node indices 
    neighboring_data_flow = sampler.__produce__(rw_sampled_n_id)

    # Compute embeddings for each `data_flow` object
    z = model(data_flow)
    negative_z = model(negative_data_flow)
    neighboring_z = model(neighboring_data_flow)
    ...

    # Compute loss
    ...

For random walk sampling, we provide a GPU only and still undocumented functionality in torch-cluster. WDYT?

mainak124 · 2019-06-28T18:48:22Z

@rusty1s I was thinking of iterating through all the edges in the graph instead of iterating through all the nodes in one epoch. But your approach is much cleaner! Thank you so much! :)

duncster94 · 2019-07-03T22:45:11Z

Hey @rusty1s quick question: does data flow computation necessarily happen on the CPU? I.e. is it always necessary to send the data flow object to the device you're using after it's computed on the CPU?

rusty1s · 2019-07-04T04:45:49Z

Yes, I am not sure if GPUs could bring any speed ups here. In the end it would require the whole graph on the GPU, something we want to prevent with this approach.

JensTheDude · 2019-07-11T08:26:04Z

@rusty1s Thank you for the graphsage implementation. Are there any plans to implement a sampler for FastGCN as well?

rusty1s · 2019-07-11T08:27:41Z

Yes :)

ankitjain451 · 2019-08-02T18:02:09Z

In unsupervised GraphSAGE, the original code iterate over edges instead of nodes and compute the reconstruction loss on sampled edges. Do you have plans to support that? Not sure how iterating over nodes will help in unsupervised case?

rusty1s · 2019-08-03T08:49:59Z

I am not sure I fully understand why you can not implement the unsupervised GraphSAGE loss with the current neighbor sampling API, see #64 (comment). Can you elaborate?

ankitjain451 · 2019-08-05T22:36:28Z

I looked into it and yes you can. I have two points:

In your code above, it is not really clear to me how would you do negative sampling. Negative samples for each batch are nodes which are not connected to existing nodes in the batch. How do you ensure that?
Also, can you please guide me more on how to structure the loss and validation for unsupervised learning. I want to use reconstruction loss (reconstructing edges based on embeddings). Currently in Tensorflow, I am batching by edges which helps me remove few edges from the training dataset and use max margin loss. Not sure if that provision is built in your code right now?

I am fairly new to Pytorch so don't mind if these simple questions.

rusty1s · 2019-08-06T05:27:26Z

In your code above, it is not really clear to me how would you do negative sampling. Negative samples for each batch are nodes which are not connected to existing nodes in the batch. How do you ensure that?

You certainly could use more sophisticated negative sampling strategies. A basic strategy to achieve this would be to just resample nodes which already occur in the sampled neighborhood. In practice, this is negligible IMO.

Also, can you please guide me more on how to structure the loss and validation for unsupervised learning. I want to use reconstruction loss (reconstructing edges based on embeddings). Currently in Tensorflow, I am batching by edges which helps me remove few edges from the training dataset and use max margin loss. Not sure if that provision is built in your code right now?

Batching by edges is also a good idea to implement the GraphSAGE loss. Given that you have sampled positive edges and sampled negative edges, you can compute the source and target node embeddings using the NeighborSampler:

pos_edge_index_batch = ...
neg_edge_index_batch = ...

sampler = NeighborSampler(..., shuffle=False)

for z_u_data_flow, z_v_data_flow, z_vn_data_flow in zip(sampler(pos_edge_index_batch[0],
                                                        sampler(pos_edge_index_batch[1],
                                                        sampler(neg_edge_index_batch[1]): 
    z_u = model(z_u_data_flow)
    z_v = model(z_v_data_flow)
    z_vn = model(z_vn_data_flow)
    # Compute loss based on Eq. (1) of https://arxiv.org/pdf/1706.02216.pdf

ankitjain451 · 2019-08-06T15:41:42Z

Thanks. I think that makes sense. Just trying to get my understanding around your library to implement edge based batching. Can you guide me where and how do I implement this while maintaining your code design. I can submit a pull request with that eventually.

rusty1s · 2019-08-06T16:05:50Z

I think a simple example on how to implement the GraphSAGE unsupervised loss should be sufficient. Feel free to submit a PR :)

wwliu555 · 2019-08-19T19:24:41Z

@rusty1s Thanks for the update! I ran into the Segmentation faultwhen running the example reddit.py

Fatal Python error: Segmentation fault

Current thread 0x00007fe21326f740 (most recent call first):
  File "/data/1/weiwen/miniconda3/envs/pyg/lib/python3.7/site-packages/torch_cluster/sampler.py", line 13 in neighbor_sampler
  File "/data/1/weiwen/miniconda3/envs/pyg/lib/python3.7/site-packages/torch_geometric/data/sampler.py", line 154 in __produce__
  File "/data/1/weiwen/miniconda3/envs/pyg/lib/python3.7/site-packages/torch_geometric/data/sampler.py", line 199 in __call__
  File "reddit.py", line 68 in train
  File "reddit.py", line 89 in <module>
Segmentation fault

Any idea how I can fix it? Thanks in advance!

rusty1s · 2019-08-19T19:34:08Z

Can you run the torch-cluster test suite to verify that everything works as expected?

wwliu555 · 2019-08-20T00:40:02Z

The test outputs said that my compiler (g++ 4.8.5) was incompatible. I upgraded it and it works fine now. Thanks:)

wwliu555 · 2019-08-21T18:36:35Z

Heyyy @rusty1s, I'm trying to make NNConv supported for NeighborSampler.

One quick question: how can I obtain edge_attr in each block? Should I add self-loops to the graph first and then use e_id to index them?

rusty1s · 2019-08-21T20:24:40Z

Exactly, self-loops need to exist in advance in order to use e_id. In addition, you need to set add_self_loops=True so that self-loops are guaranteed to be sampled.

raphaelsulzer · 2020-02-20T10:00:34Z

Is there any smart way to do inference with the neighborhood sampler?
I am using it for training on large graphs, but for inference I run out of memory very quickly.

I am thinking of something like sampling n (overlapping) regions from the graph and making sure that every node is sampled at least once. Is that currently possible?

Thanks!

duncster94 · 2020-02-20T15:00:23Z

@raphaelsulzer If you can fit all the neighbours of any single node in memory you can try this:

ns = NeighborSampler(data,
    size=1.0,
    num_hops=num_hops,
    batch_size=1,
    shuffle=False,
    add_self_loops=True)

This will sample each node and its num_hops neighbourhood so you can do a forward pass on each node one-by-one.

raphaelsulzer · 2020-02-26T00:31:35Z

@duncster94 Thank you! That works!

* clean heteroconv * init * init * clean up Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

* added HGT DBLP example * typo * Merge PyG master (#52) * Adding the Facebok Page-Page dataset * type hints * documentation CI * py 3.8 * fix links * fix links * fail on warning * fail on warning * fix doc Co-authored-by: benedekrozemberczki <benedek.rozemberczki@gmail.com> * revert * Fix Documentation Rendering (#51) * fix doc rendering * fix linting * retrigger checks * remove pytorch 1.7.0 legacy code (#50) * Fix `copy.deepcopy` within lazy `nn.dense.Linear` (#44) * fix deepcopy within lazy Linear * fix merge * assert exception * example to doc * resolve conflict * resolve conflict * Add Figure and Equation to `to_hetero` docstring (#60) * add tex * add svg + docstring * typo * added equation * Message Passing Hooks (#53) * add hooks * docstring * add docstring * allow modification of inputs/output * add test for modifying output * add additional asserts for modifying output test * Rename `HeteroData.get_edges` and `HeteroData.get_nodes` (#58) * rename to_edges and to_nodes * typo * `HeteroConv` (#64) * clean heteroconv * init * init * clean up Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de> * fix documentation * bipartite function * fix test CI * remove pillow version * clean up for merge * Merge PyG master (#69) * renaming: PointConv to PointNetConv * Fix a broken link in datasets/gdelt.py (#2800) * fix test * re-add batching of strings * add quick start table * gnn cheatsheet * remove pillow version Co-authored-by: Dongkwan Kim <todoaskit@gmail.com> * re-merge * add lazy column to GNN cheatsheet (#70) * `to_hetero_with_bases(model)` (#63) * update * fix linting * basisconv * add ValueError * to_hetero_with_bases impl done * add test * add comments * add comments * docstring * typo * update figure * svg * typo * add test * update * add rgcn equality test * typos * update * typos * update figures * generate new svgs * fix assignment * rename * delete sorted edge types * rename * add legend * fix typo * Test: Check equal outputs of `to_hetero` and `RGCNConv` (#59) * check equal output * add sparsetensor test * check equal output * add sparsetensor test * rename * linting * add missing import * `HeteroData` support for `T.NormalizeFeatures` (#56) * normalize features * allow normalization of any feature * in-place div * normalize features * allow normalization of any feature * in-place div * fix test * no need to re-assign * `HeteroData` support for `T.AddSelfLoops` (#54) * hetero support for AddSelfLoops * check for edge_index attribute * f-string * retrigger checks * revert bipartite changes * hetero support for AddSelfLoops * check for edge_index attribute * f-string * retrigger checks * revert bipartite changes * merge master * merge master * `HeteroData` support for `T.ToSparseTensor` (#55) * hetero support for ToSparseTensor * add test * customize the attribute of SparseTensor.value * rework sort_edge_index * hetero support for ToSparseTensor * add test * customize the attribute of SparseTensor.value * rework sort_edge_index * linting * `HeteroData` support for `T.ToUndirected` (#57) * to_undirected * revert bipartite changes * coalesce + undirected enhancement * merge master * revert bipartite changes * coalesce + undirected enhancement * merge master * clean up * new default relation type * fix tests * resolve merge conflicts * resolve merge conflicts 2 * resolve merge conflicts 3 * Merge PyG master (#74) * renaming: PointConv to PointNetConv * Fix a broken link in datasets/gdelt.py (#2800) * fix test * re-add batching of strings * add quick start table * gnn cheatsheet * remove pillow version * clean up doc for to_dense_batch * clean up * add legend to cheatsheet * Improve terminology (#2837) I think the previous version of the document uses the term 'symmetric' incorrectly. A symmetric matrix is a square matrix that is is equal to its transpose (https://en.wikipedia.org/wiki/Symmetric_matrix). However, the text is only talking about the shape of the matrix, not its content. Hence, 'square (matrix)' would be the correct term to use. * Add batch_size input to to_dense_batch (#2838) * Add batch_size input to to_dense_batch * to_dense_batch fix typo in batch_size param use * add typehints Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de> * typo * Added return_attention_weights to TransformerConv. (#2807) * added return_weights functionality to tranformer * added return attn weights tests * flake8 * added typehints Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de> * MD17 (#2843) * Added MD17 dataset * Updated Documentation * Added link to sGDML website in doc * fixed typos in doc and made train variable description clearer * clean up * fix linting * fix doc warning Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de> * update doc * remove forward doc * add static graph support info to cheatsheet * fix num_nodes in case edge_index is empty * fix math formula * faster GDC import * lazy import * lazy import for datasets * lazy import for nn * Sequential jittable + traceable * typo * typo * update doc Co-authored-by: Dongkwan Kim <todoaskit@gmail.com> Co-authored-by: Markus <markus.zopf@outlook.com> Co-authored-by: Jimmie <jimmiebtlr@gmail.com> Co-authored-by: Jinu Sunil <jinu.sunil@gmail.com> Co-authored-by: Moritz R Schäfer <moritz.schaefer@protonmail.com> * re-add * GraphGym cleaned version (#82) * GraphGym cleaned version * remove deepsnap dependency * fix lint errors, part 1 * fix all lint errors * fix all lint errors * fix all lint errors * apply yapf * Update .gitignore * Integrate GraphGym into PyG (#85) * GraphGym cleaned version * remove deepsnap dependency * fix lint errors, part 1 * fix all lint errors * fix all lint errors * fix all lint errors * apply yapf * Integrate graphgym into pyg, keep user API in project root * fix merge conflict * fix lint errors * Make optional dependencies * merge LICENSE from GraphGym * add import * clean up LICENSE * fix import * resolve merge conflicts * resolve merge conflicts 2 * Merge PyG master (#87) * renaming: PointConv to PointNetConv * Fix a broken link in datasets/gdelt.py (#2800) * fix test * re-add batching of strings * add quick start table * gnn cheatsheet * remove pillow version * clean up doc for to_dense_batch * clean up * add legend to cheatsheet * Improve terminology (#2837) I think the previous version of the document uses the term 'symmetric' incorrectly. A symmetric matrix is a square matrix that is is equal to its transpose (https://en.wikipedia.org/wiki/Symmetric_matrix). However, the text is only talking about the shape of the matrix, not its content. Hence, 'square (matrix)' would be the correct term to use. * Add batch_size input to to_dense_batch (#2838) * Add batch_size input to to_dense_batch * to_dense_batch fix typo in batch_size param use * add typehints Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de> * typo * Added return_attention_weights to TransformerConv. (#2807) * added return_weights functionality to tranformer * added return attn weights tests * flake8 * added typehints Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de> * MD17 (#2843) * Added MD17 dataset * Updated Documentation * Added link to sGDML website in doc * fixed typos in doc and made train variable description clearer * clean up * fix linting * fix doc warning Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de> * update doc * remove forward doc * add static graph support info to cheatsheet * fix num_nodes in case edge_index is empty * fix math formula * faster GDC import * lazy import * lazy import for datasets * lazy import for nn * Sequential jittable + traceable * typo * typo * update doc * Simple models (#2869) * Inclusion of new backbone models * Eliminating head from asap.py * small correction * Create test_gcn.py * Update __init__.py * Update test_gcn.py * Left only the convolutional simple models * Tests included * update * clean up * clean up v2 * fix activation Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de> * Example for MemPooling. (#2729) * example for mem pooling * backprop on kl loss is done at the end of an epoch. Keys in memory layers are trained only on kl loss. * added learning rate decay. Using PROTIENS_full * flake8 * reduced lr. increased weight decay * changed download location * added comments * clean up Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de> * typos * fix removeisolatednodes transform in case 'data.num_nodes' is present * fix XConv with dilation > 1 * fix XConv with dilation > 1 * rgcn link prediction (#2734) * implemented LinkPrediction dataset for loading FB15k237 * implemented evaluation for relational link prediction * implemented R-GCNConf link prediction example * fixed bug: wrong initial objects in negative_sampling * changed file downloader urllib.request.urlretrieve to pytorch.data.download_url; renamed LinkPrediction class to RelationalLinkPredictionDataset * update dataset * update example script * rename Co-authored-by: Moritz <moritzblum> Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de> * fix gnnexplainer draw kwargs * remove python-louvain dependency * allow customization of output in MP jit mode * fix test for py3.6 * changed normalisation to same norm from instance norm to be robust to small var (#2917) * add CITATION.cff * format * [ci skip] * [ci skip] * [ci skip] * [ci skip] * [ci skip] * [ci skip] * [ci skip] * [ci skip] * [ci skip] * [ci skip] * [ci skip] * [ci skip] * [ci skip] * [ci skip] * [ci skip] * [ci skip] * add basetransform ABC (#2924) * clean up BaseTransform * clean up GATConv and add comments * add max_num_neighbors as an additional argument * fix jit GATConv on PyTorch 1.8.0 * fix doc * fix gnn explainer with existing self-loops * Rgcn link pred fix (#2946) * added regularization, removed typo in test * clean up Co-authored-by: Moritz <moritzblum> Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de> * typo * Correct gini coefficient mathcal formula (#2932) * typo * typo * Update from_networkx (#2923) * Update from_networkx * Update test * Update convert.py * Minor corrections * Update test_convert.py * Corrections * Update test_convert.py * Case where there are no edges * Correcting how edge_attr are concatenated * clean up + new test * remove unused code * add union type Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de> * fix deterministic ordering in from_networkx * recursive-include *.jinja files Co-authored-by: Dongkwan Kim <todoaskit@gmail.com> Co-authored-by: Markus <markus.zopf@outlook.com> Co-authored-by: Jimmie <jimmiebtlr@gmail.com> Co-authored-by: Jinu Sunil <jinu.sunil@gmail.com> Co-authored-by: Moritz R Schäfer <moritz.schaefer@protonmail.com> Co-authored-by: PabloAMC <pmorenocf@alumnos.unex.es> Co-authored-by: Moritz Blum <31183934+moritzblum@users.noreply.github.com> Co-authored-by: fbragman <fbragman@users.noreply.github.com> Co-authored-by: Christopher Lee <2824685+CCInc@users.noreply.github.com> Co-authored-by: Tim Daubenschütz <tim@daubenschuetz.de> * resolve merge conflicts 3 * resolve merge conflicts 4 * Implementation of the `HGTLoader` + `ogbn-mag` example (#73) * first try * update * HGT Loader * typo * first try * update * HGT Loader * typo * bugfixes * lazy GATConv * bugfix * bugfix * full working pipeline * update * rename * docstring * typos * update * typo * typo * typo * added comments * add test * add tests * fix example * rename * linting * Random split functionalities (#72) * link split * create split * example tests * link split tests * fix linting * update docstring * undirected option, refactor and docs * add num nodes as argument to neg sampling * clean up + remove single object * update example * typo * fix compose Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de> * add basetransform * typo * typo * fix test * Improve `torch_geometric.data` Documentation (#98) * update data doc * typo * typo * note * typo * add docstring * only show inherited members for data and hetero_data * documentation update for batch and dataset * update doc * update * fix * record_stream * update * typo * add/fix data functionality * linting * typo * `_parent` memory leak fix (#103) * memory leak fix * Clean up * clean up * bugfix tests * typos * fix test * fix test * rename reverse * (Heterogeneous) `NeighborLoader` (#92) * initial commit * typo * neighbor loader functionality + tests * docstring * fix docstring * skip tests * fix share_memory_ * typo * typo * update example * typo * share_strategy * fix cuda calls * better print * fix size * fix print * final commit * fix * some todos * preprocessed features * fix to_undirected * more documentation * update doc * fix doc * fix doc * Add benchmark code and the example with existing graph classification examples (#93) * add benchmarking utilities * update graph classification benchmark * improve code style * add pytorch-memlab for benchmark code * skip some tests when cuda is not available * add type hint when appropriate * add seed_everything to improve code * code refactoring * code refactoring * code refactoring * code improvement * remove unnecessary dataloader import * change benchmark interface with decorator * documentation improvement * linting * linting part 2 * linting part 3 * seed_everything * create utils file * update * use utils functions * fix test * update the profiler to the latest torch (1.8.1+) * refactor profiler and add more documentation * refactor profiler and add more documentation * resolve lint errors * resolve lint errors * update * clean up test and profile * fix linting * add to doc * fix doc * typo * update benchmark Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de> * Move `HGTLoader` to `torch_geometric.loader` + clean up (#104) * move files * use utils functions * fix example * update * fix tests * fix seed * fix linear test * rename * Support GraphGym custom modules outside PyG package (#102) * GraphGym cleaned version * remove deepsnap dependency * fix lint errors, part 1 * fix all lint errors * fix all lint errors * fix all lint errors * apply yapf * Integrate graphgym into pyg, keep user API in project root * fix merge conflict * fix lint errors * Make optional dependencies * merge LICENSE from GraphGym * Enable adding GraphGym customized modules outside PyG package * lint * Rename `AddTrainValTestMask` to `RandomNodeSplit` (#108) * initial commit * rename example * remove AddTrainValTestMask * fix linting * create optimizer config and scheduler config separately (#113) * create optimizer config and scheduler config separately * fix format * import explicitly Co-authored-by: Dong Wang <dongwang@yannis-air.lan> * Heterogeneous Graph Tutorial (#83) * add HG tutorial roadmap * started working on hg tutorial * hg_tutorial, some text and .tex figure * added svg * hg tutorial content * fix CI * text and structure * finished first draft * fixed one code example * fixing conventions * fixing links * update svg * some smaller improvements of tutorial * improvements on tutorial * hg-tutorial: fixed compiling issue, added detailed content * added absolute links * fixed warnings * streamlined dataset section * update svg * update tutorial * update 2 Co-authored-by: Jan Eric Lenssen <janeric.lenssen@tu-dortmund.de> * typo * Move data loaders to `torch_geometric.loader` (#110) * move graphsaint * deprecations * move clusterloader * deprecations * type hints * move shadow * typo * typo * move datalistloader * dense data loader * random node sampler * fix doc * Lazy GNN operators (#89) * lazy cheb conv * lazy GraphConv * lazy GATv2Conv * lazy TAGConv * lazy FAConv * lazy FeaStConv * lazy NNConv * typo * fix tests * lazy SuperGATConv * lazy SuperGATConv fix * lazy SplineConv * fix lazy check * lazy GravNetConv * arma conv lazy * dense linear in gmmconv * typo * add test * lazy GMMConv * doc * rename (#116) * Revisit `MetaPath2Vec` (#114) * revisit metapath2vec * update * typo * update * fix doc * update * check for attributes rather than key * Clean up `torch_geometric.profile` further (#111) * remove print_layer_stats * typos * update * readme highlights and quick tour (#99) * readme highlights and quick tour * arch * arch image * arch overview * list categories * categorization * category description * Update README.md from Matthias Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de> * improved highlights * Update README.md Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de> * Update README.md Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de> * Update README.md Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de> * Update README.md Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de> * minor * update readme * update * update * update * update * fix url * update * update * update * update * update * update * move ops * toc * typo * typo * add svgs * update figure * fix links * fix size * fix size * typo Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de> * fix broken links * fix links * Heterogeneous Graph Sampler Tutorial (#117) * initial commit * address comments * remove todo * typo * Conversion between heterogenous and homogeneous graph objects (#115) * temp checkpoint (wip, will remove) * (wip) typed graph conversion * (wip) typed graph conversion * (wip) typed graph conversion * update * typo * delete examples Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de> * fix test * update doc * deprecate NeighborSampler (#119) * Move `torch_geometric.data.DataLoader` to `torch_geometric.loader.DataLoader` (#120) * move dataloader * rename * typos * typos * fix __cat_dim__ * updategp * Deprecate `train_test_split_edges` + Modifications to `RandomLinkSplit` (#121) * deprecate train_test_split_edges * to device transform * fix example * add split_labels argument * fix autoencoder example * typos * add docstring * ARGVA * seal * adress comments * Create example to load `*.csv` and transfer to `HeteroData` (#76) * create example to load csv file and transfer to heter-data * add ipython notebook version load csv with documentation * address comment * first version of csv loading doc * first version of csv loading doc * suggestion docs/source/notes/loading_csv.rst Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de> * suggestion docs/source/notes/loading_csv.rst Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de> * suggestion docs/source/notes/loading_csv.rst Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de> * suggestion docs/source/notes/loading_csv.rst Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de> * suggestions csv tutorial * example script load csv + extract fix * fixed edge index stacking dimension in example and jupyter nb * linting * linting2 * rename * update * update * update * typo * typo * update * rename * update tutorial * typo * address comments Co-authored-by: Dong Wang <dongwang@yannis-air.lan> Co-authored-by: Jan Eric Lenssen <janeric.lenssen@tu-dortmund.de> Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de> * typo * fix * typo * update * fix * fix Co-authored-by: benedekrozemberczki <benedek.rozemberczki@gmail.com> Co-authored-by: Rex Ying <rexying@stanford.edu> Co-authored-by: Dongkwan Kim <todoaskit@gmail.com> Co-authored-by: Markus <markus.zopf@outlook.com> Co-authored-by: Jimmie <jimmiebtlr@gmail.com> Co-authored-by: Jinu Sunil <jinu.sunil@gmail.com> Co-authored-by: Moritz R Schäfer <moritz.schaefer@protonmail.com> Co-authored-by: Jiaxuan <youjiaxuan@gmail.com> Co-authored-by: PabloAMC <pmorenocf@alumnos.unex.es> Co-authored-by: Moritz Blum <31183934+moritzblum@users.noreply.github.com> Co-authored-by: fbragman <fbragman@users.noreply.github.com> Co-authored-by: Christopher Lee <2824685+CCInc@users.noreply.github.com> Co-authored-by: Tim Daubenschütz <tim@daubenschuetz.de> Co-authored-by: Yue Zhao <yzhao062@gmail.com> Co-authored-by: Dong Wang <dongw89@gmail.com> Co-authored-by: Dong Wang <dongwang@yannis-air.lan> Co-authored-by: Jan Eric Lenssen <janeric.lenssen@tu-dortmund.de>

codexhammer · 2022-01-12T13:26:55Z

Hi @rusty1s,

I am trying to train and classify only a fraction of the total number of classes in a continual learning setting. For example, for Cora dataset with 7 classes:

path = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'Cora')
data = Planetoid(path, "Cora", transform=transforms.NormalizeFeatures())[0]
_, _, classes, train_mask, val_mask, test_mask = data
classes_in_task = [0,1,2]

conditions = torch.BoolTensor( [l in classes_in_task for l in classes[1]] )
train_mask = (train_mask[0], train_mask[1] * conditions)  # Mask all nodes with classes not in [0,1,2] in **train** dataset
val_mask = (val_mask[0], val_mask[1] * conditions) # Mask all nodes with classes not in [0,1,2] in **test** dataset
test_mask = (test_mask[0], test_mask[1] * conditions) # Mask all nodes with classes not in [0,1,2] in **validation** dataset

train_task_nid = np.nonzero(train_mask[1])  # Select node_ids with classes in [0,1,2]
train_task_nid = torch.flatten(train_task_nid)

train_loader = NeighborLoader(
        data,
        num_neighbors=[30]*2,
        batch_size=2000,
        input_nodes=train_task_nid,
        )

I use the train_loader to train the nodes with only specific classes in multiple batches. But, training the network this way for only specific nodes isn't giving me good results. Is there any workaround for this?

Also, during the testing, I want to test all the test nodes as a single batch. How to set this in Neighborloader arguments?

rusty1s · 2022-01-12T14:14:14Z

What do you mean with "it does not give good results"? Is it due to your learning setup or due to NeighborLoader?

During evaluation, you can omit the use of NeighborLoader if you want to test on a single batch/full-batch.

codexhammer · 2022-01-12T15:53:05Z

During evaluation, you can omit the use of NeighborLoader if you want to test on a single batch/full-batch.

If I omit the usage of NeighborLoader, how can I select the input nodes with only the classes [0,1,2] for testing? Is there a function for only selecting a particular set of nodes? Thanks for the reply.

rusty1s · 2022-01-12T17:15:27Z

After obtaining the model output, you apply masking only the node embeddings you are interested in:

out = model(data.x, data.edge_index)
out_train = out[data.train_mask]

vincentvic · 2022-03-04T15:06:41Z

Hi !
I'am trying to train an unsupervised GraphSAGE model on an heterogeneous graph and I have some difficulties to compute the loss with the use of NeighborLoader. Thank you fro the help

rusty1s · 2022-03-04T21:30:39Z

Can you share some more details (at best in a new issue)? I'm happy to help!

kayzliu · 2022-04-15T19:22:22Z

@rusty1s how can I know which nodes are the center nodes in the sampled data obtained by torch_geometric.loader.NeighborLoader.

rusty1s · 2022-04-16T03:23:01Z

The center nodes will be placed first in the sampled output. As such, they are given via slicing based on the batch size (the number of center nodes), that is batch.x[:batch_size]. We also apply this slicing in the linked example. Hope this clarifies your doubts!

agosztolai · 2022-05-12T15:40:16Z

Hello,

thanks for the awesome code!

I am trying to implement the negative sampling algorithm used here but using the new NeighborLoader class.

Is there already an implementation of this?

My understanding is that I need to replace

train_loader = NeighborSampler(data.edge_index, sizes=[10, 10], batch_size=256,
shuffle=True, num_nodes=data.num_nodes)

by

kwargs = {'batch_size': 256}
train_loader = NeighborLoader(data,num_neighbors=[10,10],shuffle=True, **kwargs)

Then, when I run

batch = next(iter(train_loader))

the output 'batch' looks exactly the same shape as the input "data", except for the edge_index is smaller and now I have a "batch_size" attribute too. It appears that 'data.x' has been permuted, and data.edge_index has been subsampled.

Are the 256 sampled nodes given by batch.x[:batch_size] (as per your previous answer)?
If so, what are the other nodes batch.x[batch_size:]?
In particular, where are the positive and negative samples?
Also, what is data.edge_index? - is it the induced subgraph of the center nodes or does it also include the positvie/negative samples?

rusty1s · 2022-05-14T23:10:22Z

There exists a LinkNeighborLoader class that also supports negative sampling via neg_sampling_ratio.

v01cano · 2023-09-01T08:35:39Z

请问：NeighborSampler(data, size=[1.0, 1.0], num_hops=2, batch_size=batch_size, shuffle=True, add_self_loops=True)
转换为NeighborLoader函数应该如何表达吗？

rusty1s added the feature label Mar 11, 2019

rusty1s changed the title ~~how to batch edge_index~~ Neighborhood Sampling Mar 13, 2019

rusty1s mentioned this issue Mar 20, 2019

Benchmark scripts for training time performance #139

Closed

rusty1s mentioned this issue Apr 17, 2019

Giant graph performance problem #212

Closed

rusty1s mentioned this issue May 2, 2019

Any other sampling strategies like GraphSAGE, FastGCN, AdaptiveSampling? #270

Closed

rusty1s mentioned this issue May 16, 2019

About a lot of edges #306

Closed

dcoukos mentioned this issue Feb 24, 2020

Using NeighborSampler (with InMemoryDatasets) #987

Closed

rusty1s added a commit that referenced this issue Sep 2, 2021

HeteroConv (#64)

b08ec35

* clean heteroconv * init * init * clean up Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>

gebrahimi91 mentioned this issue Jan 11, 2022

Segmentation fault when use NeighborSampler. #3835

Closed

ivaylobah closed this as completed Oct 26, 2022

Neighborhood Sampling #64

Neighborhood Sampling #64

Comments

tbright17 commented Dec 27, 2018

rusty1s commented Dec 27, 2018

tbright17 commented Dec 27, 2018

rusty1s commented Dec 27, 2018

tbright17 commented Dec 27, 2018

zeneofa commented Dec 27, 2018

rusty1s commented Dec 27, 2018

zeneofa commented Dec 28, 2018 • edited Loading

duncster94 commented Apr 18, 2019

familyld commented Apr 30, 2019

fkhawar commented May 22, 2019

rusty1s commented May 23, 2019

rusty1s commented May 31, 2019

duncster94 commented May 31, 2019

rusty1s commented Jun 1, 2019

duncster94 commented Jun 1, 2019

mainak124 commented Jun 25, 2019

rusty1s commented Jun 26, 2019

mainak124 commented Jun 28, 2019

duncster94 commented Jul 3, 2019

rusty1s commented Jul 4, 2019

JensTheDude commented Jul 11, 2019

rusty1s commented Jul 11, 2019

ankitjain451 commented Aug 2, 2019

rusty1s commented Aug 3, 2019

ankitjain451 commented Aug 5, 2019 • edited Loading

rusty1s commented Aug 6, 2019

ankitjain451 commented Aug 6, 2019

rusty1s commented Aug 6, 2019 • edited Loading

wwliu555 commented Aug 19, 2019

rusty1s commented Aug 19, 2019

wwliu555 commented Aug 20, 2019

wwliu555 commented Aug 21, 2019

rusty1s commented Aug 21, 2019 • edited Loading

raphaelsulzer commented Feb 20, 2020

duncster94 commented Feb 20, 2020

raphaelsulzer commented Feb 26, 2020

codexhammer commented Jan 12, 2022

rusty1s commented Jan 12, 2022

codexhammer commented Jan 12, 2022

rusty1s commented Jan 12, 2022

vincentvic commented Mar 4, 2022

rusty1s commented Mar 4, 2022

kayzliu commented Apr 15, 2022 • edited Loading

rusty1s commented Apr 16, 2022

agosztolai commented May 12, 2022 • edited Loading

rusty1s commented May 14, 2022

v01cano commented Sep 1, 2023

zeneofa commented Dec 28, 2018 •

edited

Loading

ankitjain451 commented Aug 5, 2019 •

edited

Loading

rusty1s commented Aug 6, 2019 •

edited

Loading

rusty1s commented Aug 21, 2019 •

edited

Loading

kayzliu commented Apr 15, 2022 •

edited

Loading

agosztolai commented May 12, 2022 •

edited

Loading