Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to changed rmm::device_scalar API #1637

Merged
merged 5 commits into from
Jun 14, 2021

Conversation

harrism
Copy link
Member

@harrism harrism commented Jun 1, 2021

After rapidsai/rmm/#789 is a breaking API change for rmm::device_scalar. This PR fixes a couple of uses of rmm::device_scalar to fix the build of cuGraph, and should be merged immediately after rapidsai/rmm/#789.

Also fixes an unrelated narrowing conversion warning.

@BradReesWork BradReesWork added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 1, 2021
@BradReesWork BradReesWork added this to the 21.06 milestone Jun 1, 2021
@BradReesWork
Copy link
Member

rerun tests

@BradReesWork BradReesWork modified the milestones: 21.06, 21.08 Jun 1, 2021
@seunghwak seunghwak mentioned this pull request Jun 1, 2021
@harrism
Copy link
Member Author

harrism commented Jun 2, 2021

CI won't pass until rapidsai/rmm/#789 is merged.

@BradReesWork
Copy link
Member

rerun tests

rapids-bot bot pushed a commit to rapidsai/rmm that referenced this pull request Jun 8, 2021
This PR refactors `device_scalar` to use a  single-element `device_uvector` for its storage. This simplifies the implementation of device_scalar. Also changes the API of `device_scalar` so that its asynchronous / stream-ordered methods use the same API style (with explicit stream parameter) as `device_uvector` and `device_buffer`.

Closes #570

This is a breaking change. When it is merged, PRs are likely to need to be merged immediately in other libraries to account for the API changes. 

 - [x] cuDF: rapidsai/cudf#8411
 - [x] cuGraph: rapidsai/cugraph#1637
 - [x] RAFT: rapidsai/raft#259  
 - [x] ~cuML~ (unused)
 - [x] ~cuSpatial~ (unused)

Authors:
  - Mark Harris (https://github.com/harrism)

Approvers:
  - Rong Ou (https://github.com/rongou)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #789
@harrism
Copy link
Member Author

harrism commented Jun 8, 2021

rerun tests

@harrism
Copy link
Member Author

harrism commented Jun 8, 2021

rerun tests

@harrism
Copy link
Member Author

harrism commented Jun 8, 2021

This PR depends on rapidsai/raft#259 also. Once that is merged, hopefully this one will pass CI.

@ChuckHastings
Copy link
Collaborator

rerun tests

1 similar comment
@ChuckHastings
Copy link
Collaborator

rerun tests

@harrism
Copy link
Member Author

harrism commented Jun 8, 2021

I suggest issueing the @gpucibot merge command now so it auto-merges once CI passes...

@ajschmidt8
Copy link
Member

@gpucibot merge

@codecov-commenter
Copy link

codecov-commenter commented Jun 8, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@6aab21f). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 155c27f differs from pull request most recent head fa9b367. Consider uploading reports for the commit fa9b367 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #1637   +/-   ##
===============================================
  Coverage                ?   59.96%           
===============================================
  Files                   ?       80           
  Lines                   ?     3542           
  Branches                ?        0           
===============================================
  Hits                    ?     2124           
  Misses                  ?     1418           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6aab21f...fa9b367. Read the comment docs.

@harrism
Copy link
Member Author

harrism commented Jun 9, 2021

I spoke too soon. This doesn't look related to this PR though:

09:54:54     from dask.dataframe.core import (
09:54:54 E   ImportError: cannot import name 'make_meta' from 'dask.dataframe.core' (/opt/conda/envs/rapids/lib/python3.8/site-packages/dask/dataframe/core.py)```

cpp/src/traversal/tsp.cu Outdated Show resolved Hide resolved
@ajschmidt8
Copy link
Member

I think the failures here could be related to this PR: rapidsai/cudf#8458

Co-authored-by: Paul Taylor <paul.e.taylor@me.com>
@ajschmidt8
Copy link
Member

rerun tests

@ChuckHastings
Copy link
Collaborator

I think we might need to update the cugraph conda dependencies as well.

The linked cudf update corrects the cudf dependencies, but our environment specifically downloads older versions.

@ajschmidt8
Copy link
Member

I found an issue with the rapids-build-env conda package that could be contributing to the failures here. I kicked off the Jenkins run below which should publish some corrected packages. Let's hold off on rerunning the tests until that job is complete.

https://gpuci.gpuopenanalytics.com/job/rapidsai/job/integration/job/branches/job/integration-branch-pipeline/567/

@ajschmidt8
Copy link
Member

rerun tests 🤞

@pentschev
Copy link
Member

I don't know if the updated rapids-build-env package was still not available but the wrong one was picked again in the last build:

05:00:30   - rapidsai-nightly/linux-64::rapids-build-env==21.08.00a210608=cuda11.0_py37_g58360b3_408

It should have been build 425 rather than 408, but rapids-notebook-env picked the correct build 425. Rerunning tests to check.

@pentschev
Copy link
Member

rerun tests

@pentschev
Copy link
Member

Even on the latest run (hasn't completed yet), the incorrect version of rapids-build-env is being picked:

15:15:07   - rapidsai-nightly/linux-64::rapids-build-env==21.08.00a210608=cuda11.0_py37_g58360b3_408

Just as in my previous comment, it should get build 425 that depends on the correct versions of Dask/Distributed.

@ajschmidt8 any ideas what could still be going on here?

@jakirkham
Copy link
Member

Wonder if we need to rebuild Docker images as well

@ajschmidt8
Copy link
Member

It looks like the correct version of dask is used all the way up until this point where libcugraph is installed. Installing libcugraph downgrades dask back to 2021.5.1 for some reason.

I think the best way to fix this would be to follow in cuml's footsteps here and install the latest version of dask after libcugraph has been installed.

@pentschev
Copy link
Member

It looks like the correct version of dask is used all the way up until this point where libcugraph is installed. Installing libcugraph downgrades dask back to 2021.5.1 for some reason.

I think the best way to fix this would be to follow in cuml's footsteps here and install the latest version of dask after libcugraph has been installed.

This may fix the issue now, but this doesn't explain why an older rapids-build-env is being picked, which seems to be the ultimate reason why Dask is getting downgraded. The workaround being suggested can blow up again in the future if another conda install is included after the point which Dask is installed from source.

@jakirkham
Copy link
Member

Would also add cugraph already seems to be installing dask + distributed from main. Had checked this previously when looking at Peter's Dask update PR ( #1666 )

@ajschmidt8
Copy link
Member

It looks like the correct version of dask is used all the way up until this point where libcugraph is installed. Installing libcugraph downgrades dask back to 2021.5.1 for some reason.
I think the best way to fix this would be to follow in cuml's footsteps here and install the latest version of dask after libcugraph has been installed.

This may fix the issue now, but this doesn't explain why an older rapids-build-env is being picked, which seems to be the ultimate reason why Dask is getting downgraded. The workaround being suggested can blow up again in the future if another conda install is included after the point which Dask is installed from source.

You are correct. We have an internal issue opened to investigate a larger issue at hand here which is likely contributing to this problem. The issue is that the wrong gpuci/rapidsai Docker image is being used for CI here. If you scroll to the top of the failed build logs, you will see it says gpuci/rapidsai:21.06-cuda11.2-devel-centos7-py3.8 instead of gpuci/rapidsai:21.08-cuda11.2-devel-centos7-py3.8. One of the differences between those two images is the dask version which is likely contributing to the problem here. For now, the workaround I described above should suffice until we work on correcting the images used in our CI system during release/codefreeze periods like we're in right now.

@ajschmidt8
Copy link
Member

Would also add cugraph already seems to be installing dask + distributed from main. Had checked this previously when looking at Peter's Dask update PR ( #1666 )

I saw that. I think the order of operations is what we'd need to change. We'd want to move the pip install to happen after the conda install of libcugraph similar to what cuml does.

@pentschev
Copy link
Member

pentschev commented Jun 11, 2021

Note that per the logs, we will need to install Dask twice, where it is now and after libcugraph. The first install upgrades from 2021.5.1, which is downgraded during libcugraph install and will need to be upgraded again.

EDIT: Nevermind, I think only after libcugraph should suffice. I thought there was some tests running before libcugraph was installed.

@pentschev
Copy link
Member

The build is already past the failure point, hopefully this will be green soon!

@BradReesWork
Copy link
Member

@gpucibot merge

@pentschev
Copy link
Member

Just like in rapidsai/cuml#3978 (comment), failures seem legit to the changes that were introduced in Distributed 2021.6.0, for example:

TypeError: No dispatch for <class 'cudf.core.index.RangeIndex'>
x = <class 'int'>, index = RangeIndex(start=0, stop=0, step=1)
parent_meta = Series([], Name: source, dtype: int32)

    def make_meta(x, index=None, parent_meta=None):
        """
        This method creates meta-data based on the type of ``x``,
        and ``parent_meta`` if supplied.
    
        Parameters
        ----------
        x : Object of any type.
            Object to construct meta-data from.
        index :  Index, optional
            Any index to use in the metadata. This is a pass-through
            parameter to dispatches registered.
        parent_meta : Object, default None
            If ``x`` is of arbitrary types and thus Dask cannot determine
            which back-end to be used to generate the meta-data for this
            object type, in which case ``parent_meta`` will be used to
            determine which back-end to select and dispatch to. To use
            utilize this parameter ``make_meta_obj`` has be dispatched.
            If ``parent_meta`` is ``None``, a pandas DataFrame is used for
            ``parent_meta`` thats chooses pandas as the backend.
    
        Returns
        -------
        A valid meta-data
        """
    
        if isinstance(
            x,
            (
                dd._Frame,
                dd.core.Scalar,
                dd.groupby._GroupBy,
                dd.accessor.Accessor,
                da.Array,
            ),
        ):
            return x._meta
    
        try:
>           return make_meta_dispatch(x, index=index)

I'm guessing cuGraph will require an update to adapt to the changes above, similar to what has been done in rapidsai/cudf#8426 . Also ccing @galipremsagar in case he has thoughts or could work on addressing those.

@galipremsagar
Copy link
Contributor

rerun tests

3 similar comments
@jakirkham
Copy link
Member

rerun tests

@jakirkham
Copy link
Member

rerun tests

@BradReesWork
Copy link
Member

rerun tests

@ajschmidt8
Copy link
Member

rerun tests

The previous error log link & screenshot are below. More than likely, we'll need to update the Docker version on the GCP node here, but I'm retrying the tests one more time to see if it happened to be a transient error.
(log link)
image

@ajschmidt8
Copy link
Member

As it turns out a100 testing shouldn't have been enabled for this repo. It's been disabled. Since that was the only failure in the previous, run I will admin merge this PR since it's holding things up.

@rapids-bot rapids-bot bot merged commit 6b5079c into rapidsai:branch-21.08 Jun 14, 2021
@jakirkham
Copy link
Member

Thanks AJ! 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants