Skip to content

Commit

Permalink
update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
JakubPietrakIntel committed Jan 3, 2024
1 parent 709dd03 commit 671fd12
Showing 1 changed file with 8 additions and 8 deletions.
16 changes: 8 additions & 8 deletions examples/distributed/pyg/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
Current application can be deployed on a cluster of arbitrary size using multiple CPUs. For GPU-based solution refer to [GraphLearn-for-PyTorch (GLT)](https://github.com/pyg-team/pytorch_geometric/tree/5c0c2924a6c041db07d397547eff7fdf833a4ff8/examples/distributed/graphlearn_for_pytorch)

The solution is designed to effortlessly distribute the training of large-scale graph neural networks across multiple nodes, thanks to the integration of Distributed Data Parallelism (DDP) for model training and Remote Procedure Call (RPC) for efficient sampling and fetching of non-local features.
The design includes a number of custom classes, i.e. `DistNeighborSampler` that implements CPU sampling algorithms from local & remote data remaining consistent data structure at the output; a fully integrated `DistLoader` which includes proper opening & closing of RPC connection between the sampler sub-processesl; a METIS-based `Partitioner` and many more.
The design includes a number of custom classes, i.e. `DistNeighborSampler` that implements CPU sampling algorithms from local & remote data remaining consistent data structure at the output; an integrated `DistLoader` which ensures safe opening & closing of RPC connection between the samplers; a METIS-based `Partitioner` and many more.

## Distributed Sage for Ogbn-Product (Homo Multi-Node) Example

Expand All @@ -19,10 +19,10 @@ To run the example please refer to the steps below.
- Latest PyG and pyg-lib.
- Password-less SSH needs to be set up on all the nodes that you are using ([Linux SSH manual](https://linuxize.com/post/how-to-setup-passwordless-ssh-login/)).
- All nodes need to have a consistent environments installed, specifically torch and pyg-lib versions must be the same. You might want to consider using docker containers.
- In some cases Linux firewall might be blocking TCP connection issues. Ensure that firewall settings allow for all nodes to communicate ([Linux firewall manual](https://ubuntu.com/server/docs/security-firewall)). For this example TCP ports 11111, 11112, 11113 should be open (`sudo ufw allow 11111`).
- [Optional] In some cases Linux firewall might be blocking TCP connection issues. Ensure that firewall settings allow for all nodes to communicate ([Linux firewall manual](https://ubuntu.com/server/docs/security-firewall)). For this example TCP ports 11111, 11112, 11113 should be open (i.e. `sudo ufw allow 11111`).


This example shows how to use [distributed PyG](https://github.com/pyg-team/pytorch_geometric/tree/master/torch_geometric/distributed) to train PyG models in a distributed scenario with multiple CPU nodes.
This example shows how to use distributed PyG to train PyG models in a distributed scenario with multiple CPU nodes.

### Step 1: Prepare and partition the data

Expand All @@ -33,20 +33,20 @@ Here, we use `ogbn-products` and partition it into two partitions (in default) b
```bash
python partition_graph.py --dataset=ogbn-products --root_dir=./data/products --num_partitions=2
```
**Caution:** Partitioning with METIS is non-deterministic!
All nodes should be able to access the same partition data, therefore generate partition on one node and copy the data to all members of the cluster, or place the folder in shared location.

The generated partition will have the folder below.

![image](https://private-user-images.githubusercontent.com/58218729/292181211-2169e362-0259-4ac4-ab5e-8500b6b5bf4a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQyOTEwODYsIm5iZiI6MTcwNDI5MDc4NiwicGF0aCI6Ii81ODIxODcyOS8yOTIxODEyMTEtMjE2OWUzNjItMDI1OS00YWM0LWFiNWUtODUwMGI2YjViZjRhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMDMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTAzVDE0MDYyNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPThiM2YxMGQzMjI5NDczZmY5ZTA4N2NkNGUxYmU5MzhjZDEzNzZiNTE2YmRmOGM4YTZkMWY2MjViNjE3NTFkNmUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.ZbyksEgs5g0P-0uf4zmdFrqJaXBjW7IsoU8Yi_ZRMGQ)
<img width="350" alt="partition-graph" src="https://github.com/pyg-team/pytorch_geometric/assets/58218729/2169e362-0259-4ac4-ab5e-8500b6b5bf4a">

**Caution:** Partitioning with METIS is non-deterministic!
All nodes should be able to access the same partition data, therefore generate partition on one node and copy the data to all members of the cluster, or place the folder in shared location.

### Step 2: Run the example in each training node

To run the example you can execute the commands in each node or use the provided launch script.

## Option A: Manual execution
You should change the `master_addr` to the IP of `node#0`.
You should change the `master_addr` to the IP of `node#0`. Make sure that correct `node_rank` is provided, with master node assigned rank 0. The `dataset_root_dir` should point to the head directory where your partition is placed, i.e. `../../data/partitions/ogbn-products/2-parts`.

```bash
# Node 0:
Expand All @@ -68,4 +68,4 @@ python distributed_cpu.py \

## Option B: Launch script
TBD
See PR [#8241](https://github.com/pyg-team/pytorch_geometric/pull/8241)
See PR [#8241](https://github.com/pyg-team/pytorch_geometric/pull/8241)

0 comments on commit 671fd12

Please sign in to comment.