Ofnil currently only supports the widely-utilized NeighborSampling
method, with plans to include other sampling methods soon. Both DGL and PyG also implement the NeighborSampling
We will first examine the differences in the interfaces of the three projects, then compare their parameters, and finally provide a code snippet to easily scale up graph learning by switching from the NeighborSampling method in DGL and PyG to that of Ofnil.
The interface of the neighbor sampling class is a Dataloader which prepares data objects from the data source (stored data with their metadata in Ofnil, Dataset in PyG/DGL) to a mini-batch. Different sampling procedures are implemented in Sampler, which is used by Dataloader. The comparison of the interface name is listed:
Sampler | Dataloader | |
DGL | NeighborSampler (Using dgl.dataloading.as_edge_prediction_sampler(sampler) for edge classification and link prediction) |
dgl.dataloading.DataLoader (Need to specify the sampler manually.) |
PyG | NeighborSampler |
torch_geometric.loader.NeighborLoader (No need to specify the sampler, which is binding to NeighborSampler . Using torch_geometric.loader.NeighborLoader , which is derived from torch_geometric.loader.NeighborLoader , for edge classification and link prediction.) |
Ofnil | NeighborSampler |
Ofnil.torch.NeighborSampledDataLoader (Ofnil.torch.NeighborSampledDataLoader is called in Ofnil.Client.get_neighbor_sampled_dataloader() or Ofnil.Client.neighbor_sampled_dataloader() , and it binds to NeighborSampler . Similarly, Ofnil.torch.LinkNeighborSampledDataLoader is called in Ofnil.Client.get_link_neighbor_sampled_dataloader() . we use HopCollector to support the output format of both DGL and PyG. We will discuss HopCollector later.) |
Parameters difference between Ofnil NeighborSampledDataLoader
and DGL NeighborSampler
and DataLoader
Ofnil NeighborSampledDataLoader | DGL NeighborSampler | DGL DataLoader | Description |
data(Tuple[TopologyFeatureViewInfo, List[TableFeatureViewInfo]]) |
graph(DGL graph) |
Input of Ofnil NeighborSampledDataLoader contains topology information and graph entity information which can be used after process. The input of DGL DataLoader is DGL graph that can be used directly. |
batch_size |
batch_size |
The number of seed nodes in each batch. | |
replace |
replace |
Whether to sample with replacement | |
hop_collector |
The hop collector used to produce the sampled subgraph for specific output format. | ||
need_edge |
Whether to include edge features, by default False | ||
kwargs |
kwargs |
For Ofnil, specify fanouts and indices . |
prefetch_labels |
output_device |
prefetch_node_feats |
prefetch_edge_feats |
fanouts |
Ofnil specify this parameter fanouts via kwarg |
edge_dir |
prob |
mask |
indices |
Ofnil specify this parameter indices via kwarg |
graph_sampler |
device |
use_ddp |
ddp_seed |
drop_last |
shuffle |
use_prefetch_thread |
use_alternate_streams |
pin_prefetcher |
use_uva |
use_cpu_worker_affinity |
cpu_worker_affinity_cores |
Parameters difference between Ofnil NeighborSampledDataLoader
and PyG NeighborLoader
Ofnil NeighborSampledDataLoader | PyG NeighborLoader | Description |
data(Tuple[TopologyFeatureViewInfo, List[TableFeatureViewInfo]]) |
data(Union[Data, HeteroData, Tuple[FeatureStore, GraphStore]]) |
The data structure used as input data is different for Ofnil and PyG. Input of Ofnil NeighborSampledDataLoader contains topology information and graph entity information which can be used after process. Data input in PyG can be used directly. |
batch_size |
The number of seed nodes in each batch. In PyG, batch_size can pass to torch.utils.data.DataLoader by kwargs . In Ofnil, batch_size is required. |
need_edge |
Whether to include edge features, by default False | |
replace |
replace |
Whether to sample with replacement. |
hop_collector |
The hop collector used to produce the sampled subgraph for specific output format. | |
kwargs |
kwarg |
Pass key value pair to parent PyTorch class torch.utils.data.DataLoader . Usually we can pass arguments:
num_neighbors |
Ofnil can pass same parameter through kwarg |
input_nodes |
Ofnil can pass same parameter through kwarg |
input_time |
directed |
disjoint |
temporal_strategy |
time_attr |
transform |
transform_sampler_output |
is_sorted |
filter_per_worker |
neighbor_sampler |
neighbor_sampler in PyG is optional. NeighborLoader can construct a neighbor sampler if there is no user input |
Node sampler
client = ofnil.Client(OFNIL_HOME) loader = client.get_neighbor_sampled_dataloader( topo_view_id, [table_view_id], sample_with_replacement=True, num_neighbors=[10, 5], batch_size=batch_size, hop_collector=DGLHopCollector(sparse=True), # or PygHopCollector(disjoint=disjoint, sparse=sparse), input_nodes="Product", ) sampled_hetero_data = next(iter(loader))
Edge sampler
client = ofnil.Client(OFNIL_HOME) loader = client.get_link_neighbor_sampled_dataloader( topo_view_id, [table_view_id], sample_with_replacement=True, num_neighbors=[10, 5], batch_size=batch_size, hop_collector=PyGHopCollector(disjoint=disjoint, sparse=sparse), edge_label_index=seed_edge_types, edge_label=torch.ones(num_edges) ) sampled_hetero_data = next(iter(loader))
Node Sampler
sampler = dgl.dataloading.NeighborSampler([5, 10, 15]) dataloader = dgl.dataloading.DataLoader( g, train_nid, sampler, batch_size=1024, shuffle=True, drop_last=False, num_workers=4) for input_nodes, output_nodes, blocks in dataloader: train_on(blocks)
Edge Sampler
sampler = dgl.dataloading.NeighborSampler([5, 10, 15]) sampler = dgl.dataloading.as_edge_prediction_sampler(sampler) dataloader = dgl.dataloading.DataLoader( g, train_eid, sampler, batch_size=1024, shuffle=True, drop_last=False, num_workers=4) for input_nodes, output_nodes, blocks in dataloader: train_on(blocks)
Node sampler
sampler = dgl.dataloading.NeighborSampler([5, 10, 15]) dataloader = dgl.dataloading.DataLoader( g, train_nid, sampler, batch_size=1024, shuffle=True, drop_last=False, num_workers=4) for input_nodes, output_nodes, blocks in dataloader: train_on(blocks)
Edge sampler
from torch_geometric.loader import LinkNeighborLoader loader = LinkNeighborLoader( hetero_data, # Sample 30 neighbors for each node for 2 iterations num_neighbors=[30] * 2, # Use a batch size of 128 for sampling training nodes batch_size=128, edge_label_index=data.edge_index, # provide edge labels for sampled edges edge_label=torch.ones(data.edge_index.size(1)) ) sampled_data = next(iter(loader))