Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output of distributed Galois is confusing #389

Open
YuxinxinChen opened this issue Feb 14, 2022 · 18 comments
Open

Output of distributed Galois is confusing #389

YuxinxinChen opened this issue Feb 14, 2022 · 18 comments

Comments

@YuxinxinChen
Copy link

Hi Galois Team,

I am trying to run pagerank-push-dist with 2 nodes: mpirun -n 2 $ROOT/lonestar/analytics/distributed/pagerank/pagerank-push-dist mygraph.gr --num_nodes=2 --partition=oec --pset=g

The output is confusing. It looks to me only process 0 is running.
Besides, for the input, if I would like to use partition ginger-o, how can I get transposed .tgr file?

Thanks!

@l-hoang
Copy link
Member

l-hoang commented Feb 14, 2022

Please post your output here.

ginger-o can be specified using the --partition option (use -h to see the correct argument). For the transpose
grpah, use the graph-convert tool under the tools directory at root level.

You can use mpirun --tag-output to confirm if many processes are running, though I see nothing wrong
with the way you are running it there (that command indicates 2 machines each with 1 gpu: if you want
1 machine with 2 GPUs on it, do mpirun -n 1 --pset=gg)

@nicelhc13
Copy link
Contributor

I am suspecting that you may not run your applications on 2 nodes. Please check if your clustering (e.g. SLURM) is being used correctly. If you want to run this on 1 nodes and 2 hosts, you should specify --pset=gg as Loc pointed out.

@YuxinxinChen
Copy link
Author

Here is my output

D-Galois Benchmark Suite v6.0.0 (unknown)
Copyright (C) 2018 The University of Texas at Austin
http://iss.ices.utexas.edu/galois/

application: PageRank - Compiler Generated Distributed Heterogeneous
Residual PageRank on Distributed Galois.

[0] Master distribution time : 0.046137 seconds to read 216 bytes in 27 seeks (0.00468171 MBPS)
[0] Starting graph reading.
[0] Reading graph complete.
[0] Edge inspection time: 28.4214 seconds to read 4335433632 bytes (152.541 MBPS)
Loading edge-data while creating edges
[0] Edge loading time: 107.842 seconds to read 4335433632 bytes (40.2018 MBPS)
[0] Graph construction complete.
[0] Using GPU 0: Tesla V100-SXM2-16GB
[0] Host memory for communication context: 1338 MB
[0] Host memory for graph: 7850 MB
[0] InitializeGraph::go called
[0] PageRank::go run 0 called
Max rank is 76137.8
Min rank is 0.15
Rank sum is 3.4924e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 1 called
Max rank is 76139.8
Min rank is 0.15
Rank sum is 3.49271e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 2 called
Max rank is 76110.6
Min rank is 0.15
Rank sum is 3.49043e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 3 called
Max rank is 76122
Min rank is 0.15
Rank sum is 3.49152e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 4 called
Max rank is 76127.7
Min rank is 0.15
Rank sum is 3.49148e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 5 called
Max rank is 76109.1
Min rank is 0.15
Rank sum is 3.49026e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 6 called
Max rank is 77099.7
Min rank is 0.15
Rank sum is 3.53274e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 7 called
Max rank is 76722.2
Min rank is 0.15
Rank sum is 3.51755e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 8 called
Max rank is 76370
Min rank is 0.15
Rank sum is 3.50499e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[0] PageRank::go run 9 called
Max rank is 76111.7
Min rank is 0.15
Rank sum is 3.49062e+07
Residual sum is 0
# nodes with residual over 0.01 (tolerance) is 0
Max residual is 1.17549e-38
Min residual is 0
[1] Master distribution time : 0.064927 seconds to read 416 bytes in 52 seeks (0.0064072 MBPS)
[1] Starting graph reading.
[1] Reading graph complete.
[1] Edge inspection time: 29.8542 seconds to read 4335991372 bytes (145.239 MBPS)
[1] Edge loading time: 104.426 seconds to read 4335991372 bytes (41.522 MBPS)
[1] Graph construction complete.
[1] Using GPU 0: Tesla V100-SXM2-16GB
[1] Host memory for communication context: 1338 MB
[1] Host memory for graph: 7850 MB
[1] InitializeGraph::go called
[1] PageRank::go run 0 called
[1] PageRank::go run 1 called
[1] PageRank::go run 2 called
[1] PageRank::go run 3 called
[1] PageRank::go run 4 called
[1] PageRank::go run 5 called
[1] PageRank::go run 6 called
[1] PageRank::go run 7 called
[1] PageRank::go run 8 called
[1] PageRank::go run 9 called

The PE 1 seems called PageRank::go without any detailed information.
More specifically, if I would like to know the communication volume between two processes, do you have some handy tools or stats available for that?

Thanks!

@nicelhc13
Copy link
Contributor

Thank you. This looks slightly weird to me. Let me try to reproduce your problems.
Do you still use the same command?
Could you please provide me results of the nvdia-smi command?

Regarding communication volumes of Gluon, you could enable this flag through CMake: GALOIS_COMM_STATS.
It provides reduce/broadcast communication volumes.
(You can find the detailed information from libgluon/include/galois/graphs/GluonSubstrate.h)

@YuxinxinChen
Copy link
Author

Yes, same command. Also I am using oak ridge Summit but it should work as Slurm. I am not sure if that causes the problem but Summit MPI has been working well for my other applications.

Here is the result of nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   29C    P0    37W / 300W |      0MiB / 16160MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Mon Feb 14 15:13:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   29C    P0    37W / 300W |      0MiB / 16160MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

------------------------------------------------------------

@l-hoang
Copy link
Member

l-hoang commented Feb 14, 2022

your output looks sane to me, actually: both hosts ([0] and [1] are picking up the GPU local to that host and
things are running); the sanity getting printed only prints on host 0, and the prints are out of order since
you can't guarantee when things get flushed to disk

do you have the stats files? (you can save it to disk with -statFile= else it's outputted to stdout, and
Hochan mentioned the flag you can use to get more communication info)

@YuxinxinChen
Copy link
Author

The stats file, but no stats information on PE 1:

STAT_TYPE, HOST_ID, REGION, CATEGORY, TOTAL_TYPE, TOTAL
STAT, 0, dGraph_Generic, EdgeLoading, HMAX, 107841
STAT, 0, dGraph_Generic, CuSPStateRounds, HOST_0, 100
STAT, 0, dGraph_Generic, EdgeInspection, HMAX, 29854
STAT, 0, dGraph_Generic, GraphReading, HMAX, 2329
STAT, 0, DistBench, GraphConstructTime, HMAX, 146516
STAT, 0, DistBench, TIMER_GRAPH_MARSHAL, HMAX, 8280
STAT, 0, PageRank, ResetGraph_0, HMAX, 1
STAT, 0, PageRank, InitializeGraph_0, HMAX, 1
STAT, 0, PageRank, PageRank_0, HMAX, 10574
STAT, 0, PageRank, NumWorkItems_0, HSUM, 201823910
STAT, 0, PageRank, NumIterations_0, HOST_0, 1046
STAT, 0, PageRank, ResetGraph_1, HMAX, 1
STAT, 0, PageRank, InitializeGraph_1, HMAX, 1
STAT, 0, PageRank, Timer_0, HMAX, 14083
STAT, 0, PageRank, PageRank_1, HMAX, 10763
STAT, 0, PageRank, NumWorkItems_1, HSUM, 201764411
STAT, 0, PageRank, NumIterations_1, HOST_0, 1115
STAT, 0, PageRank, InitializeGraph_2, HMAX, 1
STAT, 0, PageRank, Timer_1, HMAX, 14097
STAT, 0, PageRank, PageRank_2, HMAX, 10932
STAT, 0, PageRank, NumWorkItems_2, HSUM, 202000910
STAT, 0, PageRank, NumIterations_2, HOST_0, 1024
STAT, 0, PageRank, InitializeGraph_3, HMAX, 1
STAT, 0, PageRank, Timer_2, HMAX, 13794
STAT, 0, PageRank, PageRank_3, HMAX, 10792
STAT, 0, PageRank, NumWorkItems_3, HSUM, 201907168
STAT, 0, PageRank, NumIterations_3, HOST_0, 966
STAT, 0, PageRank, ResetGraph_4, HMAX, 21
STAT, 0, PageRank, InitializeGraph_4, HMAX, 1
STAT, 0, PageRank, Timer_3, HMAX, 13477
STAT, 0, PageRank, PageRank_4, HMAX, 10980
STAT, 0, PageRank, NumWorkItems_4, HSUM, 201849735
STAT, 0, PageRank, NumIterations_4, HOST_0, 946
STAT, 0, PageRank, InitializeGraph_5, HMAX, 1
STAT, 0, PageRank, Timer_4, HMAX, 13791
STAT, 0, PageRank, PageRank_5, HMAX, 11154
STAT, 0, PageRank, NumWorkItems_5, HSUM, 201998277
STAT, 0, PageRank, NumIterations_5, HOST_0, 1142
STAT, 0, PageRank, InitializeGraph_6, HMAX, 1
STAT, 0, PageRank, Timer_5, HMAX, 14269
STAT, 0, PageRank, PageRank_6, HMAX, 10076
STAT, 0, PageRank, NumWorkItems_6, HSUM, 193896959
STAT, 0, PageRank, NumIterations_6, HOST_0, 1063
STAT, 0, PageRank, ResetGraph_7, HMAX, 20
STAT, 0, PageRank, InitializeGraph_7, HMAX, 1
STAT, 0, PageRank, Timer_6, HMAX, 13303
STAT, 0, PageRank, PageRank_7, HMAX, 10584
STAT, 0, PageRank, NumWorkItems_7, HSUM, 198799184
STAT, 0, PageRank, NumIterations_7, HOST_0, 932
STAT, 0, PageRank, ResetGraph_8, HMAX, 1
STAT, 0, PageRank, InitializeGraph_8, HMAX, 1
STAT, 0, PageRank, Timer_7, HMAX, 13503
STAT, 0, PageRank, PageRank_8, HMAX, 10444
STAT, 0, PageRank, NumWorkItems_8, HSUM, 200358039
STAT, 0, PageRank, NumIterations_8, HOST_0, 964
STAT, 0, PageRank, InitializeGraph_9, HMAX, 1
STAT, 0, PageRank, Timer_8, HMAX, 13283
STAT, 0, PageRank, PageRank_9, HMAX, 10713
STAT, 0, PageRank, NumWorkItems_9, HSUM, 202030884
STAT, 0, PageRank, NumIterations_9, HOST_0, 1022
STAT, 0, PageRank, Timer_9, HMAX, 13979
STAT, 0, PageRank, TimerTotal, HMAX, 293282
STAT, 0, PageRank, ResetGraph_2, HMAX, 1
STAT, 0, PageRank, ResetGraph_3, HMAX, 1
STAT, 0, PageRank, ResetGraph_5, HMAX, 1
STAT, 0, PageRank, ResetGraph_9, HMAX, 1
STAT, 0, Gluon, ReduceSendBytes_PageRank_0, HSUM, 3713563396
STAT, 0, Gluon, ReduceNumMessages_PageRank_0, HSUM, 348
STAT, 0, Gluon, Sync_PageRank_0, HMAX, 1618
STAT, 0, Gluon, ReduceSendBytes_PageRank_1, HSUM, 3716312728
STAT, 0, Gluon, ReduceNumMessages_PageRank_1, HSUM, 290
STAT, 0, Gluon, Sync_PageRank_1, HMAX, 1353
STAT, 0, Gluon, ReduceSendBytes_PageRank_2, HSUM, 3744828140
STAT, 0, Gluon, ReduceNumMessages_PageRank_2, HSUM, 288
STAT, 0, Gluon, Sync_PageRank_2, HMAX, 1266
STAT, 0, Gluon, ReduceSendBytes_PageRank_3, HSUM, 3713466332
STAT, 0, Gluon, ReduceNumMessages_PageRank_3, HSUM, 365
STAT, 0, Gluon, Sync_PageRank_3, HMAX, 1157
STAT, 0, Gluon, ReduceSendBytes_PageRank_4, HSUM, 3741462848
STAT, 0, Gluon, ReduceNumMessages_PageRank_4, HSUM, 364
STAT, 0, Gluon, Sync_PageRank_4, HMAX, 1143
STAT, 0, Gluon, ReduceSendBytes_PageRank_5, HSUM, 3751941512
STAT, 0, Gluon, ReduceNumMessages_PageRank_5, HSUM, 388
STAT, 0, Gluon, Sync_PageRank_5, HMAX, 1504
STAT, 0, Gluon, ReduceSendBytes_PageRank_6, HSUM, 3489017220
STAT, 0, Gluon, ReduceNumMessages_PageRank_6, HSUM, 395
STAT, 0, Gluon, Sync_PageRank_6, HMAX, 1393
STAT, 0, Gluon, ReduceSendBytes_PageRank_7, HSUM, 3628752704
STAT, 0, Gluon, ReduceNumMessages_PageRank_7, HSUM, 319
STAT, 0, Gluon, Sync_PageRank_7, HMAX, 1219
STAT, 0, Gluon, ReduceSendBytes_PageRank_8, HSUM, 3600264744
STAT, 0, Gluon, ReduceNumMessages_PageRank_8, HSUM, 359
STAT, 0, Gluon, Sync_PageRank_8, HMAX, 1110
STAT, 0, Gluon, ReduceSendBytes_PageRank_9, HSUM, 3723709080
STAT, 0, Gluon, ReduceNumMessages_PageRank_9, HSUM, 333
STAT, 0, Gluon, Sync_PageRank_9, HMAX, 1356
STAT, 0, Gluon, ReplicationFactor, HOST_0, 1.85977
PARAM, 0, DistBench, CommandLine, HOST_0, /ccs/home/yuxinc/Galois/build/gcc-11.1-nvcc-11.4/lonestar/analytics/distributed/pagerank/pagerank-push-dist /gpfs/alpine/bif115/scratch/yuxinc/graph_datasets/twitter/twitter/twitter-ICWSM10-component.gr --num_nodes=2 --partition=oec --pset=g --runs=10 --exec=Async --tolerance=0.01 --statFile=/gpfs/alpine/bif115/scratch/yuxinc/Galois/pagearank-push-dist/pr-o%j.output
PARAM, 0, DistBench, Threads, HOST_0, 1
PARAM, 0, DistBench, Hosts, HOST_0, 2
PARAM, 0, DistBench, Runs, HOST_0, 10
PARAM, 0, DistBench, Run_UUID, HOST_0, dd472ca4-8a16-4966-9930-573dc7646475
PARAM, 0, DistBench, Input, HOST_0, /gpfs/alpine/bif115/scratch/yuxinc/graph_datasets/twitter/twitter/twitter-ICWSM10-component.gr
PARAM, 0, DistBench, PartitionScheme, HOST_0, oec
PARAM, 0, DistBench, Hostname, HOST_0, h30n09
PARAM, 0, PageRank, Max Iterations, HOST_0, 1000
PARAM, 0, PageRank, Tolerance, HOST_0, 0.01
PARAM, 0, dGraph, GenericPartitioner, HOST_0, 0

Particularly, I am interested in the load balancing on different processes and the communication volume between processes. From the stats results, I got no information for PE 1. Besides the libgluon/include/galois/graphs/GluonSubstrate.h, is there any tools or stats available for local workload?

@l-hoang
Copy link
Member

l-hoang commented Feb 14, 2022

If you want per host timers in the stats file, set GALOIS_PRINT_PER_HOST_STATS=1 when you run the program.

Local workload are the InitializeGraph_, PageRank_, etc timers (one for each run); you can get the timer names by
looking at the page rank source and looking at the timers surrouding each compute.

@YuxinxinChen
Copy link
Author

I didn't find GALOIS_PRINT_PER_HOST_STATS either macro or CMake flag or environmental variable or variable in your master branch code. Could you explain more about setting GALOIS_PRINT_PER_HOST_STATS=1 when I run the program?

Thanks in advance!

@nicelhc13
Copy link
Contributor

Now this flag is PRINT_PER_HOST_STATS=1.
Could you please use this flag?
Below are my command and part of result:

PRINT_PER_HOST_STATS=1 mpirun -np 2 ./pagerank-push-dist test.tgr --num_nodes=2 --partition=oec --pset=g

STAT, 0, PageRank, NumWorkItems_0, HostValues, 9735; 24846
STAT, 0, PageRank, PageRank_0, HMAX, 16
STAT, 0, PageRank, PageRank_0, HostValues, 16; 12
STAT, 0, PageRank, NumIterations_0, HOST_0, 185
STAT, 0, PageRank, NumIterations_0, HostValues, 185
STAT, 0, PageRank, Timer_0, HMAX, 28             
STAT, 0, PageRank, Timer_0, HostValues, 27; 28

For example, as you can see STAT, 0, PageRank, Timer_0, HostValues, 27; 28, the first 27 is runtime in milisec of host 0, and the next 28 is runtime in millisec of host 1.

@l-hoang
Copy link
Member

l-hoang commented Feb 18, 2022

slight correction: the order of appareance of the HostValues does not correspond to the host; e.g. the first 27 isn't necessarily host 0

unfortunately the stats have no way to distinguish which timer belongs to which at the moment

@YuxinxinChen
Copy link
Author

YuxinxinChen commented Feb 18, 2022

Thanks a lot for your help! I am able to print our information including workload, time per host. This is convenient, great work!
I ran pagerank with single GPU and 2 multi-node GPUs, I would expect close half of time on 2 GPUs comparing to single GPU runtime but I got similar runtime with single GPU and 2 GPUs. Is this normal or I did something wrong? (I tried all partition method and async, sync options, and tried on several social network graphs such as twitter, soc-LiveJournal1 and hollywood)

@nicelhc13
Copy link
Contributor

nicelhc13 commented Feb 18, 2022

It is hard to answer it with only this information. But generally it should be scalable. Pleasse check Glun paper. It includes GPU scalability result. Please check and understand time breakdowns on the stat file.

It is possible that communication overhead outweighs computation distributions.

@roshandathathri
Copy link
Member

Are you using GALOIS_DO_NOT_BIND_THREADS=1 mpirun --bind-to none?

@YuxinxinChen
Copy link
Author

Are you using GALOIS_DO_NOT_BIND_THREADS=1 mpirun --bind-to none?

No, the command I use is: mpirun -n 2 $ROOT/lonestar/analytics/distributed/pagerank/pagerank-push-dist mygraph.gr --num_nodes=2 --partition=oec --pset=g --exec=Async. I tried Async and Sync and all partition options.

I am converting the twitter40 from the website: https://snap.stanford.edu/data/twitter-2010.html. I will try this dataset and see if it could get a better strong scaling performance. If not, I might do some wrong and I hope I could get help from you guys.

Thanks!

@nicelhc13
Copy link
Contributor

Could you please run with GALOIS_DO_NOT_BIND_THREADS=1 environment, as Roshan suggested?
It could affect performances on distributed apps.

@YuxinxinChen
Copy link
Author

YuxinxinChen commented Feb 19, 2022

I tried the flag GALOIS_DO_NOT_BIND_THREADS=1 on 2 multi-node GPU:

STAT_TYPE, HOST_ID, REGION, CATEGORY, TOTAL_TYPE, TOTAL
STAT, 0, dGraph_Generic, EdgeLoading, HMAX, 563
STAT, 0, dGraph_Generic, CuSPStateRounds, HOST_0, 100
STAT, 0, dGraph_Generic, EdgeInspection, HMAX, 688
STAT, 0, dGraph_Generic, GraphReading, HMAX, 213
STAT, 0, DistBench, GraphConstructTime, HMAX, 2205
STAT, 0, DistBench, TIMER_GRAPH_MARSHAL, HMAX, 1083
STAT, 0, PageRank, ResetGraph_0, HMAX, 1
STAT, 0, PageRank, PageRank_0, HMAX, 995
STAT, 0, PageRank, NumWorkItems_0, HSUM, 28967334
STAT, 0, PageRank, NumIterations_0, HOST_0, 56
STAT, 0, PageRank, Timer_0, HMAX, 1013
STAT, 0, PageRank, PageRank_1, HMAX, 965
STAT, 0, PageRank, NumWorkItems_1, HSUM, 28967334
STAT, 0, PageRank, NumIterations_1, HOST_0, 55
STAT, 0, PageRank, Timer_1, HMAX, 985
STAT, 0, PageRank, PageRank_2, HMAX, 964
STAT, 0, PageRank, NumWorkItems_2, HSUM, 28967334
STAT, 0, PageRank, NumIterations_2, HOST_0, 55
STAT, 0, PageRank, Timer_2, HMAX, 985
STAT, 0, PageRank, PageRank_3, HMAX, 965
STAT, 0, PageRank, NumWorkItems_3, HSUM, 28967334
STAT, 0, PageRank, NumIterations_3, HOST_0, 55
STAT, 0, PageRank, Timer_3, HMAX, 985
STAT, 0, PageRank, PageRank_4, HMAX, 964
STAT, 0, PageRank, NumWorkItems_4, HSUM, 28967334
STAT, 0, PageRank, NumIterations_4, HOST_0, 55
STAT, 0, PageRank, Timer_4, HMAX, 985
STAT, 0, PageRank, PageRank_5, HMAX, 963
STAT, 0, PageRank, NumWorkItems_5, HSUM, 28967334
STAT, 0, PageRank, NumIterations_5, HOST_0, 55
STAT, 0, PageRank, Timer_5, HMAX, 985
STAT, 0, PageRank, PageRank_6, HMAX, 965
STAT, 0, PageRank, NumWorkItems_6, HSUM, 28967334
STAT, 0, PageRank, NumIterations_6, HOST_0, 55
STAT, 0, PageRank, Timer_6, HMAX, 985
STAT, 0, PageRank, PageRank_7, HMAX, 964
STAT, 0, PageRank, NumWorkItems_7, HSUM, 28967334
STAT, 0, PageRank, NumIterations_7, HOST_0, 55
STAT, 0, PageRank, Timer_7, HMAX, 985
STAT, 0, PageRank, PageRank_8, HMAX, 965
STAT, 0, PageRank, NumWorkItems_8, HSUM, 28967334
STAT, 0, PageRank, NumIterations_8, HOST_0, 55
STAT, 0, PageRank, Timer_8, HMAX, 985
STAT, 0, PageRank, PageRank_9, HMAX, 965
STAT, 0, PageRank, NumWorkItems_9, HSUM, 28967334
STAT, 0, PageRank, NumIterations_9, HOST_0, 55
STAT, 0, PageRank, Timer_9, HMAX, 985
STAT, 0, PageRank, TimerTotal, HMAX, 13182
STAT, 0, Gluon, ReduceNumMessages_PageRank_0, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_1, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_2, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_3, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_4, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_5, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_6, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_7, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_8, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_9, HSUM, 0
STAT, 0, Gluon, ReplicationFactor, HOST_0, 1
PARAM, 0, DistBench, CommandLine, HOST_0, /ccs/home/yuxinc/Galois/build/gcc-11.1-nvcc-11.4/lonestar/analytics/distributed/pagerank/pagerank-push-dist /gpfs/alpine/bif115/scratch/yuxinc/graph_datasets/soc-LiveJournal1/soc-LiveJournal1.gr --partition=oec --pset=g --runs=10 --exec=Async --tolerance=0.01 --graphTranspose=/gpfs/alpine/bif115/scratch/yuxinc/graph_datasets/soc-LiveJournal1/soc-LiveJournal1.tgr
PARAM, 0, DistBench, Threads, HOST_0, 1
PARAM, 0, DistBench, Hosts, HOST_0, 2
PARAM, 0, DistBench, Runs, HOST_0, 10
PARAM, 0, DistBench, Run_UUID, HOST_0, d9dd8ff2-ad8f-4a5b-b6b1-1810266f41e9
PARAM, 0, DistBench, Input, HOST_0, /gpfs/alpine/bif115/scratch/yuxinc/graph_datasets/soc-LiveJournal1/soc-LiveJournal1.gr
PARAM, 0, DistBench, PartitionScheme, HOST_0, oec
PARAM, 0, DistBench, Hostname, HOST_0, g35n12
PARAM, 0, PageRank, Max Iterations, HOST_0, 1000
PARAM, 0, PageRank, Tolerance, HOST_0, 0.01
PARAM, 0, dGraph, GenericPartitioner, HOST_0, 0

I think the time is this: STAT, 0, PageRank, Timer_9, HMAX,, I averaged the time for 10 runs: 987.800000 ms.

Here is the single GPU run:

STAT_TYPE, HOST_ID, REGION, CATEGORY, TOTAL_TYPE, TOTAL
STAT, 0, dGraph_Generic, EdgeLoading, HMAX, 820
STAT, 0, dGraph_Generic, CuSPStateRounds, HOST_0, 100
STAT, 0, dGraph_Generic, EdgeInspection, HMAX, 812
STAT, 0, dGraph_Generic, GraphReading, HMAX, 199
STAT, 0, DistBench, GraphConstructTime, HMAX, 2580
STAT, 0, DistBench, TIMER_GRAPH_MARSHAL, HMAX, 1229
STAT, 0, PageRank, PageRank_0, HMAX, 1083
STAT, 0, PageRank, NumWorkItems_0, HSUM, 28967334
STAT, 0, PageRank, NumIterations_0, HOST_0, 55
STAT, 0, PageRank, Timer_0, HMAX, 1137
STAT, 0, PageRank, PageRank_1, HMAX, 1032
STAT, 0, PageRank, NumWorkItems_1, HSUM, 28967334
STAT, 0, PageRank, NumIterations_1, HOST_0, 55
STAT, 0, PageRank, Timer_1, HMAX, 1051
STAT, 0, PageRank, PageRank_2, HMAX, 1044
STAT, 0, PageRank, NumWorkItems_2, HSUM, 28967334
STAT, 0, PageRank, NumIterations_2, HOST_0, 55
STAT, 0, PageRank, Timer_2, HMAX, 1087
STAT, 0, PageRank, PageRank_3, HMAX, 1000
STAT, 0, PageRank, NumWorkItems_3, HSUM, 28967334
STAT, 0, PageRank, NumIterations_3, HOST_0, 55
STAT, 0, PageRank, Timer_3, HMAX, 1039
STAT, 0, PageRank, PageRank_4, HMAX, 1037
STAT, 0, PageRank, NumWorkItems_4, HSUM, 28967334
STAT, 0, PageRank, NumIterations_4, HOST_0, 55
STAT, 0, PageRank, Timer_4, HMAX, 1057
STAT, 0, PageRank, PageRank_5, HMAX, 1027
STAT, 0, PageRank, NumWorkItems_5, HSUM, 28967334
STAT, 0, PageRank, NumIterations_5, HOST_0, 55
STAT, 0, PageRank, Timer_5, HMAX, 1046
STAT, 0, PageRank, PageRank_6, HMAX, 1012
STAT, 0, PageRank, NumWorkItems_6, HSUM, 28967334
STAT, 0, PageRank, NumIterations_6, HOST_0, 55
STAT, 0, PageRank, Timer_6, HMAX, 1032
STAT, 0, PageRank, PageRank_7, HMAX, 1035
STAT, 0, PageRank, NumWorkItems_7, HSUM, 28967334
STAT, 0, PageRank, NumIterations_7, HOST_0, 55
STAT, 0, PageRank, Timer_7, HMAX, 1091
STAT, 0, PageRank, PageRank_8, HMAX, 1032
STAT, 0, PageRank, NumWorkItems_8, HSUM, 28967334
STAT, 0, PageRank, NumIterations_8, HOST_0, 55
STAT, 0, PageRank, Timer_8, HMAX, 1051
STAT, 0, PageRank, PageRank_9, HMAX, 1052
STAT, 0, PageRank, NumWorkItems_9, HSUM, 28967334
STAT, 0, PageRank, NumIterations_9, HOST_0, 55
STAT, 0, PageRank, Timer_9, HMAX, 1072
STAT, 0, PageRank, TimerTotal, HMAX, 14484
STAT, 0, Gluon, ReduceNumMessages_PageRank_0, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_1, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_2, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_3, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_4, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_5, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_6, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_7, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_8, HSUM, 0
STAT, 0, Gluon, ReduceNumMessages_PageRank_9, HSUM, 0
STAT, 0, Gluon, ReplicationFactor, HOST_0, 1
PARAM, 0, DistBench, CommandLine, HOST_0, /ccs/home/yuxinc/Galois/build/gcc-11.1-nvcc-11.4/lonestar/analytics/distributed/pagerank/pagerank-push-dist /gpfs/alpine/bif115/scratch/yuxinc/graph_datasets/soc-LiveJournal1/soc-LiveJournal1.gr --pset=g --runs=10 --tolerance=0.01 --graphTranspose=/gpfs/alpine/bif115/scratch/yuxinc/graph_datasets/soc-LiveJournal1/soc-LiveJournal1.tgr
PARAM, 0, DistBench, Threads, HOST_0, 1
PARAM, 0, DistBench, Hosts, HOST_0, 1
PARAM, 0, DistBench, Runs, HOST_0, 10
PARAM, 0, DistBench, Run_UUID, HOST_0, 1fa420a5-fd01-4ca3-919a-2ceca2c72b0b
PARAM, 0, DistBench, Input, HOST_0, /gpfs/alpine/bif115/scratch/yuxinc/graph_datasets/soc-LiveJournal1/soc-LiveJournal1.gr
PARAM, 0, DistBench, PartitionScheme, HOST_0, oec
PARAM, 0, DistBench, Hostname, HOST_0, b17n13
PARAM, 0, PageRank, Max Iterations, HOST_0, 1000
PARAM, 0, PageRank, Tolerance, HOST_0, 0.01
PARAM, 0, dGraph, GenericPartitioner, HOST_0, 0

The average time is 1066.3 ms.
The strong scaling number is 0.92.
This is run on V100 GPUs

@YuxinxinChen
Copy link
Author

I ran the twitter40, on a single GPU, the runtime is 11238.800000, and on 2 GPUs is 7248.000000 with oec partition. I feel this scaling makes more sense to me. Do you get similar perf?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants