Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating tables with a dask-cudf DataFrame #219

Closed

Conversation

sarahyurick
Copy link
Collaborator

Dask-SQL can create tables directly from storage in SQL syntax:

CREATE TABLE my_data WITH (
    format = 'csv',
    location = './iris.csv'
)

As suggested by @randerzander it would be nice to have a way to create this backed by a dask-cudf DF instead of a Dask CPU DF. This is my first step toward achieving that.

So far, I just added a boolean parameter gpu, although perhaps something like HOW='gpu' vs default HOW='cpu' would be better. These changes are also only for reading in a single file. Mainly want to get your thoughts about how to support this @nils-braun :)

For example, something like this works with this PR:

CREATE OR REPLACE TABLE my_data WITH (
    format = 'csv',
    location = './iris.csv',
    gpu=True
)

@sarahyurick
Copy link
Collaborator Author

Looks like the tests are failing from everywhere else the new parameter would need to be added...

Copy link
Collaborator

@nils-braun nils-braun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sarahyurick !
I see two possibilities to work around the additional-keyword-argument-problem:

  • Either you add into each input plugin which does not support gpu a first line with throws an exception when gpu is set to true
  • or you do not specifically use a gpu argument in the main functions and let only the locations input plugin handle it. Additional arguments are passed via kwargs, so you could just extract it from there (for example the intake plugin has a intake_table_name, which is only used for this class)

dask_sql/physical/rel/custom/create_table.py Outdated Show resolved Hide resolved
@sarahyurick
Copy link
Collaborator Author

sarahyurick commented Aug 19, 2021

Cool, thanks for your input! I also shared your comment with @ayushdg who opened #220.

@sarahyurick sarahyurick reopened this Aug 19, 2021
@sarahyurick
Copy link
Collaborator Author

Okay, it looks like there were only 2 other places in the code (in pandas.py and intake.py) where it was expecting the extra parameter, so I updated those. What do you think? I also like your idea to let only the locations input plugin handle it, so I could definitely do that instead, especially if you think there are any places I am potentially overlooking.

@codecov-commenter
Copy link

codecov-commenter commented Aug 20, 2021

Codecov Report

Merging #219 (fd41b05) into main (0c3a6a1) will decrease coverage by 0.52%.
The diff coverage is 70.96%.

Impacted file tree graph

@@             Coverage Diff             @@
##              main     #219      +/-   ##
===========================================
- Coverage   100.00%   99.48%   -0.52%     
===========================================
  Files           64       64              
  Lines         2470     2500      +30     
  Branches       340      350      +10     
===========================================
+ Hits          2470     2487      +17     
- Misses           0        8       +8     
- Partials         0        5       +5     
Impacted Files Coverage Δ
dask_sql/context.py 100.00% <ø> (ø)
dask_sql/input_utils/convert.py 100.00% <ø> (ø)
dask_sql/input_utils/hive.py 100.00% <ø> (ø)
dask_sql/physical/rel/logical/filter.py 88.88% <ø> (-11.12%) ⬇️
dask_sql/input_utils/pandas.py 69.23% <33.33%> (-30.77%) ⬇️
dask_sql/input_utils/location.py 86.95% <40.00%> (-13.05%) ⬇️
dask_sql/input_utils/intake.py 86.66% <50.00%> (-13.34%) ⬇️
dask_sql/physical/rel/custom/create_experiment.py 100.00% <100.00%> (ø)
dask_sql/physical/rel/custom/create_model.py 100.00% <100.00%> (ø)
dask_sql/physical/rel/custom/create_table.py 100.00% <100.00%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f7ebc1...fd41b05. Read the comment docs.

@nils-braun
Copy link
Collaborator

Sorry @sarahyurick - I just realised that the coverage was not shown correctly (because we moved to the new git space) and that the hive test was disabled.
I will fix both in #221 - you might need to update to the newest main version once merged. After that, we can check together if you maybe need some additional tests or any pragma: no cover annotations.

@sarahyurick
Copy link
Collaborator Author

Sounds good! There were some conflicts that required me to re-open this PR as #224.

@nils-braun
Copy link
Collaborator

Ok, thanks! Just as a note (but you might already know this): you do not need to create a new PR if there are conflicts - you can just rebase and force-push or (preferred) merge the new main branch into your branch, solve the merge conflicts and push again. But also opening a new PR is fine :-)

@sarahyurick sarahyurick deleted the create_table_with_gpu branch September 21, 2022 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants