Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] dask.uniform_neighbor_sample gives different results depending on the number of GPUs used #2761

Closed
VibhuJawa opened this issue Sep 29, 2022 · 3 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@VibhuJawa
Copy link
Member

VibhuJawa commented Sep 29, 2022

Describe the bug

dask.uniform_neighbor_sample gives different results depending on the number of GPUs used. Results for a single GPU are correct while others seem to be incorrect

See below MRE

Correct

Number of workers = 1, Fanout 1, Number of Samples: 1
Number of workers = 1, Fanout 2, Number of Samples: 2
Number of workers = 1, Fanout 3, Number of Samples: 3
Number of workers = 2, Fanout 1, Number of Samples: 2
Number of workers = 2, Fanout 2, Number of Samples: 4
Number of workers = 2, Fanout 3, Number of Samples: 5
Number of workers = 3, Fanout 1, Number of Samples: 3
Number of workers = 3, Fanout 2, Number of Samples: 5
Number of workers = 3, Fanout 3, Number of Samples: 7
Number of workers = 4, Fanout 1, Number of Samples: 4
Number of workers = 4, Fanout 2, Number of Samples: 7
Number of workers = 4, Fanout 3, Number of Samples: 10

Steps/Code to reproduce bug

MRE:

def run_multigraph_sampling(client):
    import cugraph.dask.comms.comms as Comms
    Comms.initialize(p2p=True)
    
    n_workers = len(client.scheduler_info()['workers'])
    import cudf
    import dask_cudf
    import numpy as np

    import cugraph
    df = cudf.DataFrame({'src':[6]*128,
                         'dst':np.arange(0,128)
                        })
    df = df.astype('int32')

    dask_df = dask_cudf.from_cudf(df, npartitions=128)

    mg_G = cugraph.MultiGraph(directed=True)
    mg_G.from_dask_cudf_edgelist(dask_df,
                                 source='src',
                                 destination='dst',
                                 renumber=True,
                                 legacy_renum_only=True)

    for fanout in [1,2,3]:
        output_df = cugraph.dask.uniform_neighbor_sample(
            mg_G,
            cudf.Series([6]).astype('int32'),
            fanout_vals=[fanout],
            with_replacement=False,
        )
        print(f"Number of workers = {n_workers}, Fanout {fanout}, Number of Samples: {len(output_df)}")
    
    Comms.destroy()

Cluster Stup Code

from dask_cuda import LocalCUDACluster
from dask.distributed import Client


for CUDA_VISIBLE_DEVICES in ['4', '4,5', '4,5,6', '4,5,6,7']:
    with LocalCUDACluster(protocol='tcp',
                               local_directory='/raid/vjawa/dask-cuda-dir/',
                               CUDA_VISIBLE_DEVICES=CUDA_VISIBLE_DEVICES
                           ) as cluster, Client(cluster) as client:
        
        
        
        run_multigraph_sampling(client)
        
        client.shutdown()

Expected behavior

I would expect correct results indepent of the number of GPUS being used.

Environment details

# packages in environment at /datasets/vjawa/miniconda3/envs/cugraph_dev_sep_29:
cugraph                   22.10.0a0+88.g91598080          pypi_0    pypi
libcugraphops             22.10.00a220929 cuda11_g553bacf_29    rapidsai-nightly
pylibcugraph              22.10.0a0+88.g91598080          pypi_0    pypi

Additional context

Probably related Issue: #2760

Probably related PR that can fix this: #2751

CC: @ChuckHastings
CC: @alexbarghi-nv , @rlratzel

@VibhuJawa VibhuJawa added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 29, 2022
@ChuckHastings
Copy link
Collaborator

Will recheck when PR 2751 is ready. Thanks for reporting so I can check this condition as well.

@BradReesWork BradReesWork removed the ? - Needs Triage Need team to review and classify label Sep 30, 2022
@BradReesWork BradReesWork added this to the 22.10 milestone Sep 30, 2022
rapids-bot bot pushed a commit that referenced this issue Sep 30, 2022
This PR fixes rapidsai/graph_dl#27

This PR fixes rapidsai/graph_dl#43

This PR fixes rapidsai/graph_dl#39



**Tests Added:** 

Single GPU:
- [x]  APIs like num_nodes, num_edges
- [x]  test_sampling_basic
- [x]  test_sampling_homogeneous_gs_in_dir
- [x]  test_sampling_homogeneous_gs_out_dir
- [x]  test_sampling_gs_homogeneous_neg_one_fanout
- [x]  test_sampling_gs_heterogeneous_in_dir 
- [x]  test_sampling_gs_heterogeneous_out_dir
- [x]  test_sampling_gs_heterogeneous_neg_one_fanout 



Multi GPU:
- [x] APIs like num_nodes, num_edges
- [x] test_sampling_basic
- [x]  test_sampling_homogeneous_gs_in_dir
- [x]  test_sampling_homogeneous_gs_out_dir 
- [x] test_sampling_gs_homogeneous_neg_one_fanout 
- [x] test_sampling_gs_heterogeneous_in_dir 
- [x] test_sampling_gs_heterogeneous_out_dir
- [x] test_sampling_gs_heterogeneous_neg_one_fanout 


Bugs to reproduce:
- [ ] Repro heterogeneous single gpu hang outside pytest
- [x] Repro hetrogenous multi gpu incorrect results for with_replace=False 
#2760
- [x] Repro hetrogenous incorrect results for different amount of GPUs
#2761

Tests that depend upon #2523 
- [x] Add minimal example to PR to ensure it gets fixed
Added comment here: #2523 (review)
- [x]  test_get_node_storage_gs  (Failing cause of a PG bug) 
- [x]  test_get_edge_storage_gs  (Failing cause of a PG bug) 
- [x] test_get_node_storage_gs (Failing cause of a PG bug) 
- [x] test_get_edge_storage_gs (Failing cause of a PG bug)

Authors:
  - Vibhu Jawa (https://github.com/VibhuJawa)

Approvers:
  - Rick Ratzel (https://github.com/rlratzel)
  - Brad Rees (https://github.com/BradReesWork)

URL: #2592
@ChuckHastings
Copy link
Collaborator

Same analysis as #2760

@rlratzel
Copy link
Contributor

rlratzel commented Oct 3, 2022

closed by PR #2765

@rlratzel rlratzel closed this as completed Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants