The SubG
package is an extension library based on C and OpenMP to accelerate subgraph operations for building structural features and subgraph-based graph representation learning (SGRL).
Follow the principles of algorithm system co-design, subgraph queries (e.g. ego-network in canonical SGRLs) of target links, motifs, and high-order patterns can be decomposed into node-level intermediate results (e.g. collection of walks by walk_sampler
in SUREL, set of nodes by gset_sampler
in SUREL+), whose joint can act as proxies of subgraphs, and be reused among different queries.
Currently, SubG
consists of the following methods for efficient and scalable implementation of SGRLs:
-
gset_sampler
node set sampling with structure encoder of landing probability (LP) -
walk_sampler
walk sampling with relative positional encoder (RPE) -
batch_sampler
query sampling (a group of nodes) for mini-batch training of link prediction -
walk_join
online joining of node-level walks to construct the proxy of subgraph for given queries (e.g. a link query$Q= \lbrace u,v \rbrace$ $\to$ join sampled walks of node$u$ and$v$ as$\mathcal{G}_{Q} = \lbrace W_u \uplus W_v \rbrace$ )
Dec. 30, 2024
- Release v2.3 with bug fixes and improved memory efficiency
- Support pip install & MacOS
Feb. 25, 2023:
- Release v2.2 with more robust memory management of allocation, release and indexing (billion edges).
- Add bitwise-based hash for encoding structural features.
- Add test cases and script of wall time measure.
Jan. 29, 2023:
- Release v2.1 with refactored code base.
- More robust memory accessing with buffer for set sampler on large graphs (million nodes).
Jan. 28, 2023:
- Release v2.0 with the walk-based set sampler
gset_sampler
.
(Other versions may work, but are untested)
- python >= 3.8
- numpy >= 1.17
- gcc >= 8.4
- openmp (for MacOS, install llvm-openmp via Conda)
pip install .
subg.gset_sampler(indptr, indices, query, num_walks, num_steps)
-> (numpy.array [n, num_walks*(num_steps+1)], n * (numpy.array [?], numpy.array [?,num_steps+1]))
Sample a collection of paths for each node in query
(size of n
) through num_walks
-many num_steps
-step random walks on the input graph in CSR format (indptr
, indices
), and encodes landing probability at each step of all nodes in the sampled set as structural features of the seed node.
For usage examples, see test.py.
- indptr (np.array) - Index pointer array of the adjacency matrix in CSR format.
- indices (np.array) - Index array of the adjacency matrix in CSR format.
- query (np.array / list) - Nodes are queried to be sampled.
- num_walks (int) - The number of random walks.
- num_steps (int) - The number of steps in a walk.
- nthread (int, optional) - The number of threads.
- seed (int, optional) - Random seed.
-
walks (np.array) - Sampled walks
$W_q$ for each node inquery
. -
rpes (np.array, np.array) - Unique node set of sampled walks for each node in
query
and their corresponding structural encodings.
subg.gset_sampler(indptr, indices, query, num_walks, num_steps)
-> (numpy.array [n], numpy.array [2,?], numpy.array [?,num_steps+1])
Sample a node set for each node in query
(size of n
) through num_walks
-many num_steps
-step random walks on the input graph in CSR format (indptr
, indices
), and encodes landing probability at each step of all nodes in the sampled set as structural features of the seed node.
For usage examples, see test.py.
- indptr (np.array) - Index pointer array of the adjacency matrix in CSR format.
- indices (np.array) - Index array of the adjacency matrix in CSR format.
- query (np.array / list) - Nodes are queried to be sampled.
- num_walks (int) - The number of random walks.
- num_steps (int) - The number of steps in a walk.
- bucket (int, optional) - The buffer size for sampled neighbors per node.
- nthread (int, optional) - The number of threads.
- seed (int, optional) - Random seed.
- nsize (np.array) - The size of sampled set for each node in
query
. - remap (np.array) - Pairwised node id and the index of its associated structural encoding in
enc
array. - enc (np.array) - The compressed (unique) encoding of structural features.
subg.walk_join(walks, indices, query)
-> (numpy.array [2,n*num_walks*(num_steps+1)*2])
Join the sampled walks for nodes in each query
(size of n
). For a link query walk_join
returns the indices of structural features for
For usage examples, see test.py.
- indptr (np.array) - Index pointer array of the adjacency matrix in CSR format.
- indices (np.array) - Index array of the adjacency matrix in CSR format.
- query (np.array / list) - Nodes are queried to be sampled.
- nthread (int, optional) - The number of threads.
- join_walk (np.array) - The indices of structural features attached to the joint walks of given queries.