SubG: Subgraph Operation Accelerator

The SubG package is an extension library based on C and OpenMP to accelerate subgraph operations for building structural features and subgraph-based graph representation learning (SGRL).

Follow the principles of algorithm system co-design, subgraph queries (e.g. ego-network in canonical SGRLs) of target links, motifs, and high-order patterns can be decomposed into node-level intermediate results (e.g. collection of walks by walk_sampler in SUREL, set of nodes by gset_sampler in SUREL+), whose joint can act as proxies of subgraphs, and be reused among different queries.

Currently, SubG consists of the following methods for efficient and scalable implementation of SGRLs:

gset_sampler node set sampling with structure encoder of landing probability (LP)
walk_sampler walk sampling with relative positional encoder (RPE)
batch_sampler query sampling (a group of nodes) for mini-batch training of link prediction
walk_join online joining of node-level walks to construct the proxy of subgraph for given queries (e.g. a link query $Q= \lbrace u,v \rbrace$ $\to$ join sampled walks of node $u$ and $v$ as $\mathcal{G}_{Q} = \lbrace W_u \uplus W_v \rbrace$)

Update

Dec. 30, 2024

Release v2.3 with bug fixes and improved memory efficiency
Support pip install & MacOS

Feb. 25, 2023:

Release v2.2 with more robust memory management of allocation, release and indexing (billion edges).
Add bitwise-based hash for encoding structural features.
Add test cases and script of wall time measure.

Jan. 29, 2023:

Release v2.1 with refactored code base.
More robust memory accessing with buffer for set sampler on large graphs (million nodes).

Jan. 28, 2023:

Release v2.0 with the walk-based set sampler gset_sampler.

Requirements

(Other versions may work, but are untested)

python >= 3.8
numpy >= 1.17
gcc >= 8.4
openmp (for MacOS, install llvm-openmp via Conda)

Installation

pip install .

Functions

walk_sampler

subg.gset_sampler(indptr, indices, query, num_walks, num_steps) 
-> (numpy.array [n, num_walks*(num_steps+1)], n * (numpy.array [?], numpy.array [?,num_steps+1]))

Sample a collection of paths for each node in query (size of n) through num_walks-many num_steps-step random walks on the input graph in CSR format (indptr, indices), and encodes landing probability at each step of all nodes in the sampled set as structural features of the seed node.

For usage examples, see test.py.

Parameters

indptr (np.array) - Index pointer array of the adjacency matrix in CSR format.
indices (np.array) - Index array of the adjacency matrix in CSR format.
query (np.array / list) - Nodes are queried to be sampled.
num_walks (int) - The number of random walks.
num_steps (int) - The number of steps in a walk.
nthread (int, optional) - The number of threads.
seed (int, optional) - Random seed.

Returns

walks (np.array) - Sampled walks $W_q$ for each node in query.
rpes (np.array, np.array) - Unique node set of sampled walks for each node in query and their corresponding structural encodings.

gset_sampler

subg.gset_sampler(indptr, indices, query, num_walks, num_steps) 
-> (numpy.array [n], numpy.array [2,?], numpy.array [?,num_steps+1])

Sample a node set for each node in query (size of n) through num_walks-many num_steps-step random walks on the input graph in CSR format (indptr, indices), and encodes landing probability at each step of all nodes in the sampled set as structural features of the seed node.

For usage examples, see test.py.

Parameters

indptr (np.array) - Index pointer array of the adjacency matrix in CSR format.
indices (np.array) - Index array of the adjacency matrix in CSR format.
query (np.array / list) - Nodes are queried to be sampled.
num_walks (int) - The number of random walks.
num_steps (int) - The number of steps in a walk.
bucket (int, optional) - The buffer size for sampled neighbors per node.
nthread (int, optional) - The number of threads.
seed (int, optional) - Random seed.

Returns

nsize (np.array) - The size of sampled set for each node in query.
remap (np.array) - Pairwised node id and the index of its associated structural encoding in enc array.
enc (np.array) - The compressed (unique) encoding of structural features.

walk_join

subg.walk_join(walks, indices, query) 
-> (numpy.array [2,n*num_walks*(num_steps+1)*2])

Join the sampled walks for nodes in each query (size of n). For a link query $Q= \lbrace u,v \rbrace$, walk_join returns the indices of structural features for $u$ as $W_{u|u} \bigoplus W_{u|v}$ and for $v$ as $W_{v|u} \bigoplus W_{v|v}$.

For usage examples, see test.py.

Parameters

indptr (np.array) - Index pointer array of the adjacency matrix in CSR format.
indices (np.array) - Index array of the adjacency matrix in CSR format.
query (np.array / list) - Nodes are queried to be sampled.
nthread (int, optional) - The number of threads.

Returns

join_walk (np.array) - The indices of structural features attached to the joint walks of given queries.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
include		include
lib		lib
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SubG: Subgraph Operation Accelerator

Update

Requirements

Installation

Functions

walk_sampler

Parameters

Returns

gset_sampler

Parameters

Returns

walk_join

Parameters

Returns

About

Releases

Packages

Languages

License

VeritasYin/subg_acc

Folders and files

Latest commit

History

Repository files navigation

SubG: Subgraph Operation Accelerator

Update

Requirements

Installation

Functions

walk_sampler

Parameters

Returns

gset_sampler

Parameters

Returns

walk_join

Parameters

Returns

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages