Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add execution_options to SQLQueryDataSet #1865

Conversation

clotildeguinard
Copy link
Contributor

@clotildeguinard clotildeguinard commented Sep 22, 2022

Signed-off-by: Clotilde Guinard clotilde_guinard@hotmail.fr

Description

Resolves #752

Development notes

Now a SQLQueryDataSet can be customized with execution options (see https://docs.sqlalchemy.org/en/latest/core/connections.html#sqlalchemy.engine.Connection.execution_options).

Internally, the SQLQueryDataSet class used to "cache" the different engines in order to reuse the same object when 2 datasets where sharing the same connection string. So we had to extend the behaviour of this cache in order to take the execution options into account in the cache key.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Resolves #752

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>
Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>
Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>
Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>
Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>
@clotildeguinard clotildeguinard marked this pull request as draft September 22, 2022 11:03
@clotildeguinard
Copy link
Contributor Author

Sorry I realize a bit late that I may simplify my implementation by injecting the execution_options at load time. This would avoid to complexify the caching behaviour... I will submit an additional commit on top of those, and un-draft the PR after that.

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>
@clotildeguinard clotildeguinard marked this pull request as ready for review September 22, 2022 12:05
@clotildeguinard
Copy link
Contributor Author

Ok, I have pushed the new implementation with updated tests.

Question for later: should I apply the same approach to the SQLTableDataSet? (which is quite close to the SQLQueryDataSet)

@noklam
Copy link
Contributor

noklam commented Sep 26, 2022

@MerelTheisenQB

I am slightly concerned about this change. SQLQueryDataSet is essentially a wrapper on pandas API, with this change we are adding new behaviors on top of pandas. It's a question why are we doing this and it starts to blur the API, is this still a pandas/SQLQueryDataSet then? Do we have a generic to handle connection instead of injecting parameters into each dataset? The original article about server-side cursor is certainly interesting and it's great to learn about it.

Note that at the moment pandas doesn't really support this natively. See this recently closed PR

@clotildeguinard
Copy link
Contributor Author

@noklam I will let you decide with @MerelTheisenQB about the relevance of the issue itself.

Just to answer about the following question that you asked: "Do we have a generic to handle connection instead of injecting parameters into each dataset?". Actually, this is what I did in my first implementation, which injected the execution options at engine level.

Then I added the following commit (the last one of the branch at the moment) cfa45af in order to modify this and inject params into each dataset, because I thought it was adding less complexity on the "caching" mechanism (hint: 16 line changes instead of 53). But we can easily go back to the previous implementation, this is just 1 commit to revert.

@merelcht
Copy link
Member

I'm not too concerned about this change adding extra functionality which doesn't exist in the pandas API. The way I see it is that we leverage existing APIs where we can, but also add in features if they benefit users. If a user wants to stick closer to the original pandas API they can always decide not to use this functionality.

I see @deepyaman was tagged in the original issue, do you have any thoughts on this?

@merelcht merelcht mentioned this pull request Nov 7, 2022
10 tasks
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution, this is something I don't know and happy to learn more about this batching option :) I drop some comments on the documentation, I am happy with the implementation otherwise.

Comment on lines 353 to 356
execution_options: A dictionary with non-SQL options for the connection to
be applied to the underlying engine. To find all supported execution
options, see here:
https://docs.sqlalchemy.org/en/12/core/connections.html#sqlalchemy.engine.Connection.execution_options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this deserves more documentation as this is an advance feature and not documented in the pandas API.

i.e. What options are common and when do we use it?

The article discusses client-side & server-side computation, are there any tradeoffs here, or server-side is always preferred? I think we should at least mention that if users are running into memory issues, it is worths trying this option.
https://pythonspeed.com/articles/pandas-sql-chunking/

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some small comments, but I think when those are resolved this is good to be merged 🙂

@@ -346,6 +350,10 @@ def __init__( # pylint: disable=too-many-arguments
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.open
All defaults are preserved, except `mode`, which is set to `r` when loading.
filepath: A path to a file with a sql query statement.
execution_options: A dictionary with non-SQL options for the connection to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth mentioning that this functionality isn't supported by pandas so that users are aware in case they only want to use pandas features.

},
"execution_options": {
"type": "object",
"description": "A dictionary with non-SQL options for the connection\nto be applied to the underlying engine.\nTo find all supported execution options, see here:\nhttps://docs.sqlalchemy.org/en/12/core/connections.html#sqlalchemy.engine.Connection.execution_options"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also add a similar comment here about this being additional functionality that doesn't currently exist in pandas.

@noklam noklam self-assigned this Nov 15, 2022
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>
@noklam
Copy link
Contributor

noklam commented Nov 15, 2022

added some docs and change the example, I think the correct option should be stream_results instead of streaming.

noklam and others added 2 commits November 15, 2022 14:47
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>
@noklam noklam requested a review from merelcht November 15, 2022 15:34
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small comments from me, but otherwise this looks good to be merged 👍 🙂

@@ -286,7 +286,19 @@ class SQLQueryDataSet(AbstractDataSet[None, pd.DataFrame]):
>>> type: pandas.SQLQueryDataSet
>>> sql: "select shuttle, shuttle_id from spaceflights.shuttles;"
>>> credentials: db_credentials
>>> layer: raw

Advance example using the `stream_results` and `chunk_size` option to reduce memorgy usage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Advance example using the `stream_results` and `chunk_size` option to reduce memorgy usage
Advanced example using the `stream_results` and `chunksize` option to reduce memory usage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a tonsss for picking this up, I am really bad at this😅.

tests/extras/datasets/pandas/test_sql_dataset.py Outdated Show resolved Hide resolved
@clotildeguinard
Copy link
Contributor Author

Hi all, thanks a lot for the reviews and comments! I am really sorry but I am not able to handle comments currently. I will probably be back starting from next week. I will have a look if the PR is not merged by then.

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>
@noklam
Copy link
Contributor

noklam commented Nov 15, 2022

@clotildeguinard Thank you for creating the PR! No worries I will take over the PR for the minor changes, they are mostly just comments and documentation. It's good to go. 👯

@noklam noklam merged commit 665b46d into kedro-org:main Nov 15, 2022
luizvbo pushed a commit to luizvbo/kedro that referenced this pull request Nov 21, 2022
* Add execution_options to SQLQueryDataSet

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* Update RELEASE.md

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* Fix lint issues

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* Add a check in the unit test

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* Add test for connection reuse behaviour

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* Inject execution_options at load time

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* enhance doc and fix incorrect example

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* More docs change

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Fixing typos

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>
Co-authored-by: Nok <nok_lam_chan@mckinsey.com>
Co-authored-by: Nok Lam Chan <mediumnok@gmail.com>
Signed-off-by: Luiz Otavio V. B. Oliveira <luiz.vbo@gmail.com>
noklam added a commit that referenced this pull request Nov 30, 2022
* Update attrs requirement from ~=21.3 to >=22.1.0,<23.0 in /dependency

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>

* Updated the documentation to reflect the code changes

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>

* Add Comment

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>

* Changed Attrs pin to attrs>=20.0, <23.0 

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>

* Updated Release Notes

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>

* Updated Lower Bound to be 21.3

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>

* Updated Release Notes

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>

* Update RELEASE.md

Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Remove upper bound

* Add a video dataset (#1312)

This commit adds a video dataset that can read and write video files. The
dataset is backed by OpenCVs video reader and writer and different
buffer protocols such as PIL.Image and numpy.ndarray. There is one class
for iterables (e.g. list of frames) and one for generators. Since large
videos might not fit in memory, this ability allows us to read one frame
from the video file, transform the frame and write it to another video
file. Thus there is only need for one frame at a time in memory.

The different codecs that are supported depends on the codecs installed
and how OpenCV was compiled.

Signed-off-by: Daniel Falk <daniel.falk.1@fixedit.ai>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Update fsspec requirement in /dependency (#2024)

Updates the requirements on [fsspec](https://github.com/fsspec/filesystem_spec) to permit the latest version.
- [Release notes](https://github.com/fsspec/filesystem_spec/releases)
- [Commits](fsspec/filesystem_spec@2021.04.0...2022.11.0)

---
updated-dependencies:
- dependency-name: fsspec
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Update pip-tools requirement from ~=6.9 to ~=6.10 in /dependency (#2023)

Updates the requirements on [pip-tools](https://github.com/jazzband/pip-tools) to permit the latest version.
- [Release notes](https://github.com/jazzband/pip-tools/releases)
- [Changelog](https://github.com/jazzband/pip-tools/blob/master/CHANGELOG.md)
- [Commits](jazzband/pip-tools@6.9.0...6.10.0)

---
updated-dependencies:
- dependency-name: pip-tools
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Add execution_options to SQLQueryDataSet (#1865)

* Add execution_options to SQLQueryDataSet

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* Update RELEASE.md

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* Fix lint issues

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* Add a check in the unit test

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* Add test for connection reuse behaviour

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* Inject execution_options at load time

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* enhance doc and fix incorrect example

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* More docs change

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Fixing typos

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>
Co-authored-by: Nok <nok_lam_chan@mckinsey.com>
Co-authored-by: Nok Lam Chan <mediumnok@gmail.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Update attrs requirement from ~=21.3 to >=22.1.0,<23.0 in /dependency

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Updated the documentation to reflect the code changes

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Add Comment

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Changed Attrs pin to attrs>=20.0, <23.0

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Updated Release Notes

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Updated Lower Bound to be 21.3

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Updated Release Notes

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Update links to Slack & add note about security (#2033)

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Apply OpenSSF Best Practices Badge (#2034)

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Add SVMLight DataSet (#1992)

* Add SVMLight DataSet

Add DataSet for svmlight/libsvm files using scikit-learn library as backend.

Resolves #1972

Signed-off-by: Kirill Korotkov <korotkovkm@gmail.com>

* Pin scikit-learn version to work with python 3.7

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Update dataset docstring

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>

* Pin requirements to work with python 3.7

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Add requirements to setup.py

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Add changes to dataset docstring

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Dummy commit to retrigger CI pt1

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Dummy commit to retrigger CI pt2

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Update dataset dockstring

Signed-off-by: Kirill Korotkov <korotkovkm@gmail.com>

Signed-off-by: Kirill Korotkov <korotkovkm@gmail.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Co-authored-by: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com>
Co-authored-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Update RELEASE.md

Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Improve get started docs and guide to working with notebooks (#2031)

* Revised the Introduction to make it short and sweet.
* Revised the Get Started section. Gone is "Hello Kedro". Gone are the installation pre-requisites (that's just part of the Install Kedro page now). Gone is the "Standalone use of the data catalog - woot woot" and GONE is the page on Kedro starters.
* Reordered the create project material to put the project structure breakdown in the section that introduces key concepts and shorten the Iris tutorial to the bare minimum. I did add visualisation at this point though, to highlight Kedro Viz, as I felt it was coming far too late in the spaceflights tutorial and needed to be more prominent as a feature.
* Added a TL;DR page to Get Started which some people could probably just use as-is and ignore the rest of the section.
* Starters material has moved into a new section all about "Kedro project setup". Much of that section still needs review/revision but I have updated the Starters page so it reads more clearly.
* Improved the Kedro-Viz page somewhat (still more to come for Plotly)
* Notebooks/IPython materials now merged and simplified

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Remove upper bound

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>
Signed-off-by: Daniel Falk <daniel.falk.1@fixedit.ai>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>
Signed-off-by: Kirill Korotkov <korotkovkm@gmail.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Co-authored-by: Nok Lam Chan <mediumnok@gmail.com>
Co-authored-by: Nok Lam Chan <nok_lam_chan@mckinsey.com>
Co-authored-by: Daniel Falk <daniel@da-robotteknik.se>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: clotildeguinard <clotilde_guinard@hotmail.fr>
Co-authored-by: Yetunde Dada <43755008+yetudada@users.noreply.github.com>
Co-authored-by: Cyril Korotkov <korotkovkm@gmail.com>
Co-authored-by: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com>
Co-authored-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding execution_options to kedro.extras.datasets.pandas.SQLQueryDataSet
3 participants