Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Critical: Force cudf.concat when passing in a cudf Series to MG Uniform Neighbor Sample #3416

Merged
merged 12 commits into from
Apr 5, 2023

Conversation

alexbarghi-nv
Copy link
Member

Currently, cudf does not merge series properly when they already share an index. I'm not sure if this is a bug in cudf, or intentional behavior. This issue does not occur with dask_cudf. The resolution is to use cudf.concat when passing a cudf.Series for start vertices and batch ids, and df.to_frame().merge when passing in a dask_cudf.Series for start vertices and batch ids.

This PR also adds an additional test which tests both cudf and dask_cudf inputs to catch these sort of problems in the future.

@alexbarghi-nv alexbarghi-nv changed the title Force cudf.concat when passing in a cudf Series to MG Uniform Neighbor Sample [BUG] Critical: Force cudf.concat when passing in a cudf Series to MG Uniform Neighbor Sample Apr 4, 2023
@alexbarghi-nv alexbarghi-nv self-assigned this Apr 4, 2023
@alexbarghi-nv alexbarghi-nv added bug Something isn't working non-breaking Non-breaking change labels Apr 4, 2023
@alexbarghi-nv alexbarghi-nv marked this pull request as ready for review April 4, 2023 18:58
@alexbarghi-nv alexbarghi-nv requested a review from a team as a code owner April 4, 2023 18:58
@BradReesWork BradReesWork added this to the 23.04 milestone Apr 4, 2023
Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a test. I have one minor change request related to the docstring, otherwise it LGTM.

Comment on lines +404 to +422
start_list = start_list.to_frame()
batch_id_list = batch_id_list.to_frame()
ddf = start_list.merge(
batch_id_list,
how="left",
left_index=True,
right_index=True,
)
else:
# sg input
ddf = cudf.concat(
[
start_list,
batch_id_list,
],
axis=1,
)
else:
ddf = start_list
ddf = start_list.to_frame()
Copy link
Member

@VibhuJawa VibhuJawa Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really care about the index here ? I think not . Does below work ?

start_list = start_list.reset_index(drop=True)
batch_id_list = batch_id_list.reset_index(drop=True)

if isinstance(start_list, dask_cudf.Series):
   ddf = dd.concat([start_list, batch_id_list], ignore_unknown_divisions=True, axis=1) 
else:
   ddf = cudf.concat([start_list, batch_id_list], axis =1, ignore_index=True) 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we reset index can we join batch id and start list correctly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also, I ran into an issue with dask_cudf.concat where the name of the series was dropped in one of my first attempts at a solution. dask_cudf.merge doesn't have that problem.

Copy link
Member

@VibhuJawa VibhuJawa Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to, from the logic you shared , we are merging on index ( left_index=True, right_index=True) in dask which is the same thing but more inefficient.

Edit: Also added ingore_index=True to make it more concrete in cuDF.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let me try this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VibhuJawa I just confirmed this is not an issue with dask-cudf, it's an issue with our get_distributed_data function. I will make an issue for cugraph instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why calling merge instead of concat before get_distributed_data works, but for some reason the bug completely disappears with merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take a look too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating an issue .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should link it here, sorry: #3420

Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alexbarghi-nv
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit e76406d into rapidsai:branch-23.04 Apr 5, 2023
@alexbarghi-nv alexbarghi-nv deleted the sampling-fix-concat branch April 5, 2023 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants