-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading tables with a dask-cudf DataFrame #224
Conversation
Codecov Report
@@ Coverage Diff @@
## main #224 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 64 64
Lines 2589 2590 +1
Branches 362 361 -1
=========================================
+ Hits 2589 2590 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sarahyurick!
I have only two comments before we can merge:
- there are two additional input methods,
hive.py
anddask.py
. The latter is trivial (I guess a Dask-cudf data frame is also a Dask data frame, so we can just keep the logic), but you should also add a check like in the intake plugin to not allow for GPUs inhive.py
(or we also re-write it to allow GPUs, but maybe that is something for the next step). I am actually wondering why the tests did not fail for hive... - can you make sure the coverage is again 100%? On the pandas-like-PR I did already ask, how we can best test the CPU behaviour via GitHub actions. I think for the beginning, we need to have
# pragma: no cover
comments in all gpu-only places. I would like to keep the 100% coverage if possible (even if this means we will need some coverage exceptions).
Sounds good - I've updated |
dask_sql/input_utils/pandaslike.py
Outdated
if gpu: # pragma: no cover | ||
import dask_cudf | ||
|
||
return dask_cudf.from_cudf( | ||
cudf.from_pandas(input_item), npartitions=npartitions, **kwargs, | ||
) | ||
else: | ||
return dd.from_pandas(input_item, npartitions=npartitions, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this input util accepts both cudf
and pandas
dataframes as valid inputs, you'd probably need an additional check here to check if input_item
is a pandas dataframe or not, and call the from_pandas
function only for that case.
I like this! LGTM! |
Updated version of #219. Also tagging @ayushdg if you have time to double check the
pandaslike.py
changes specifically?