Add execution_options to SQLQueryDataSet #1865

clotildeguinard · 2022-09-22T10:35:18Z

Signed-off-by: Clotilde Guinard clotilde_guinard@hotmail.fr

Description

Resolves #752

Development notes

Now a SQLQueryDataSet can be customized with execution options (see https://docs.sqlalchemy.org/en/latest/core/connections.html#sqlalchemy.engine.Connection.execution_options).

Internally, the SQLQueryDataSet class used to "cache" the different engines in order to reuse the same object when 2 datasets where sharing the same connection string. So we had to extend the behaviour of this cache in order to take the execution options into account in the cache key.

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes

Resolves #752

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

clotildeguinard · 2022-09-22T11:04:37Z

Sorry I realize a bit late that I may simplify my implementation by injecting the execution_options at load time. This would avoid to complexify the caching behaviour... I will submit an additional commit on top of those, and un-draft the PR after that.

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

clotildeguinard · 2022-09-22T12:07:48Z

Ok, I have pushed the new implementation with updated tests.

Question for later: should I apply the same approach to the SQLTableDataSet? (which is quite close to the SQLQueryDataSet)

noklam · 2022-09-26T10:56:31Z

@MerelTheisenQB

I am slightly concerned about this change. SQLQueryDataSet is essentially a wrapper on pandas API, with this change we are adding new behaviors on top of pandas. It's a question why are we doing this and it starts to blur the API, is this still a pandas/SQLQueryDataSet then? Do we have a generic to handle connection instead of injecting parameters into each dataset? The original article about server-side cursor is certainly interesting and it's great to learn about it.

Note that at the moment pandas doesn't really support this natively. See this recently closed PR

clotildeguinard · 2022-09-26T13:49:08Z

@noklam I will let you decide with @MerelTheisenQB about the relevance of the issue itself.

Just to answer about the following question that you asked: "Do we have a generic to handle connection instead of injecting parameters into each dataset?". Actually, this is what I did in my first implementation, which injected the execution options at engine level.

Then I added the following commit (the last one of the branch at the moment) cfa45af in order to modify this and inject params into each dataset, because I thought it was adding less complexity on the "caching" mechanism (hint: 16 line changes instead of 53). But we can easily go back to the previous implementation, this is just 1 commit to revert.

merelcht · 2022-09-30T09:57:34Z

I'm not too concerned about this change adding extra functionality which doesn't exist in the pandas API. The way I see it is that we leverage existing APIs where we can, but also add in features if they benefit users. If a user wants to stick closer to the original pandas API they can always decide not to use this functionality.

I see @deepyaman was tagged in the original issue, do you have any thoughts on this?

noklam

Thank you for your contribution, this is something I don't know and happy to learn more about this batching option :) I drop some comments on the documentation, I am happy with the implementation otherwise.

noklam · 2022-11-10T13:22:15Z

kedro/extras/datasets/pandas/sql_dataset.py

+            execution_options: A dictionary with non-SQL options for the connection to
+                be applied to the underlying engine. To find all supported execution
+                options, see here:
+                https://docs.sqlalchemy.org/en/12/core/connections.html#sqlalchemy.engine.Connection.execution_options


I think this deserves more documentation as this is an advance feature and not documented in the pandas API.

i.e. What options are common and when do we use it?

The article discusses client-side & server-side computation, are there any tradeoffs here, or server-side is always preferred? I think we should at least mention that if users are running into memory issues, it is worths trying this option.
https://pythonspeed.com/articles/pandas-sql-chunking/

merelcht

I left some small comments, but I think when those are resolved this is good to be merged 🙂

merelcht · 2022-11-10T11:27:16Z

kedro/extras/datasets/pandas/sql_dataset.py

@@ -346,6 +350,10 @@ def __init__(  # pylint: disable=too-many-arguments
                https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.open
                All defaults are preserved, except `mode`, which is set to `r` when loading.
            filepath: A path to a file with a sql query statement.
+            execution_options: A dictionary with non-SQL options for the connection to


I think it's worth mentioning that this functionality isn't supported by pandas so that users are aware in case they only want to use pandas features.

merelcht · 2022-11-10T11:29:42Z

static/jsonschema/kedro-catalog-0.18.json

+              },
+              "execution_options": {
+                "type": "object",
+                "description": "A dictionary with non-SQL options for the connection\nto be applied to the underlying engine.\nTo find all supported execution options, see here:\nhttps://docs.sqlalchemy.org/en/12/core/connections.html#sqlalchemy.engine.Connection.execution_options"


I'd also add a similar comment here about this being additional functionality that doesn't currently exist in pandas.

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

noklam · 2022-11-15T14:45:48Z

added some docs and change the example, I think the correct option should be stream_results instead of streaming.

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

…ydataset

merelcht

Two small comments from me, but otherwise this looks good to be merged 👍 🙂

merelcht · 2022-11-15T16:16:57Z

kedro/extras/datasets/pandas/sql_dataset.py

@@ -286,7 +286,19 @@ class SQLQueryDataSet(AbstractDataSet[None, pd.DataFrame]):
        >>>   type: pandas.SQLQueryDataSet
        >>>   sql: "select shuttle, shuttle_id from spaceflights.shuttles;"
        >>>   credentials: db_credentials
-        >>>   layer: raw
+
+    Advance example using the `stream_results` and `chunk_size` option to reduce memorgy usage


Suggested change

Advance example using the `stream_results` and `chunk_size` option to reduce memorgy usage

Advanced example using the `stream_results` and `chunksize` option to reduce memory usage

Thanks a tonsss for picking this up, I am really bad at this😅.

tests/extras/datasets/pandas/test_sql_dataset.py

clotildeguinard · 2022-11-15T16:50:31Z

Hi all, thanks a lot for the reviews and comments! I am really sorry but I am not able to handle comments currently. I will probably be back starting from next week. I will have a look if the PR is not merged by then.

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

noklam · 2022-11-15T17:55:54Z

@clotildeguinard Thank you for creating the PR! No worries I will take over the PR for the minor changes, they are mostly just comments and documentation. It's good to go. 👯

* Add execution_options to SQLQueryDataSet Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * Update RELEASE.md Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * Fix lint issues Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * Add a check in the unit test Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * Add test for connection reuse behaviour Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * Inject execution_options at load time Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * enhance doc and fix incorrect example Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * More docs change Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Fixing typos Signed-off-by: Nok <nok_lam_chan@mckinsey.com> Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> Co-authored-by: Nok <nok_lam_chan@mckinsey.com> Co-authored-by: Nok Lam Chan <mediumnok@gmail.com> Signed-off-by: Luiz Otavio V. B. Oliveira <luiz.vbo@gmail.com>

* Update attrs requirement from ~=21.3 to >=22.1.0,<23.0 in /dependency Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> * Updated the documentation to reflect the code changes Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> * Add Comment Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> * Changed Attrs pin to attrs>=20.0, <23.0 Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> * Updated Release Notes Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> * Updated Lower Bound to be 21.3 Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> * Updated Release Notes Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> * Update RELEASE.md Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Remove upper bound * Add a video dataset (#1312) This commit adds a video dataset that can read and write video files. The dataset is backed by OpenCVs video reader and writer and different buffer protocols such as PIL.Image and numpy.ndarray. There is one class for iterables (e.g. list of frames) and one for generators. Since large videos might not fit in memory, this ability allows us to read one frame from the video file, transform the frame and write it to another video file. Thus there is only need for one frame at a time in memory. The different codecs that are supported depends on the codecs installed and how OpenCV was compiled. Signed-off-by: Daniel Falk <daniel.falk.1@fixedit.ai> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Update fsspec requirement in /dependency (#2024) Updates the requirements on [fsspec](https://github.com/fsspec/filesystem_spec) to permit the latest version. - [Release notes](https://github.com/fsspec/filesystem_spec/releases) - [Commits](fsspec/filesystem_spec@2021.04.0...2022.11.0) --- updated-dependencies: - dependency-name: fsspec dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Update pip-tools requirement from ~=6.9 to ~=6.10 in /dependency (#2023) Updates the requirements on [pip-tools](https://github.com/jazzband/pip-tools) to permit the latest version. - [Release notes](https://github.com/jazzband/pip-tools/releases) - [Changelog](https://github.com/jazzband/pip-tools/blob/master/CHANGELOG.md) - [Commits](jazzband/pip-tools@6.9.0...6.10.0) --- updated-dependencies: - dependency-name: pip-tools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Add execution_options to SQLQueryDataSet (#1865) * Add execution_options to SQLQueryDataSet Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * Update RELEASE.md Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * Fix lint issues Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * Add a check in the unit test Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * Add test for connection reuse behaviour Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * Inject execution_options at load time Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * enhance doc and fix incorrect example Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * More docs change Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Fixing typos Signed-off-by: Nok <nok_lam_chan@mckinsey.com> Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> Co-authored-by: Nok <nok_lam_chan@mckinsey.com> Co-authored-by: Nok Lam Chan <mediumnok@gmail.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Update attrs requirement from ~=21.3 to >=22.1.0,<23.0 in /dependency Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Updated the documentation to reflect the code changes Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Add Comment Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Changed Attrs pin to attrs>=20.0, <23.0 Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Updated Release Notes Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Updated Lower Bound to be 21.3 Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Updated Release Notes Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Update links to Slack & add note about security (#2033) Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Apply OpenSSF Best Practices Badge (#2034) Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Add SVMLight DataSet (#1992) * Add SVMLight DataSet Add DataSet for svmlight/libsvm files using scikit-learn library as backend. Resolves #1972 Signed-off-by: Kirill Korotkov <korotkovkm@gmail.com> * Pin scikit-learn version to work with python 3.7 Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Update dataset docstring Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> * Pin requirements to work with python 3.7 Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Add requirements to setup.py Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Add changes to dataset docstring Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Dummy commit to retrigger CI pt1 Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Dummy commit to retrigger CI pt2 Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Update dataset dockstring Signed-off-by: Kirill Korotkov <korotkovkm@gmail.com> Signed-off-by: Kirill Korotkov <korotkovkm@gmail.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> Co-authored-by: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com> Co-authored-by: Ahdra Merali <ahdra.merali@quantumblack.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Update RELEASE.md Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Improve get started docs and guide to working with notebooks (#2031) * Revised the Introduction to make it short and sweet. * Revised the Get Started section. Gone is "Hello Kedro". Gone are the installation pre-requisites (that's just part of the Install Kedro page now). Gone is the "Standalone use of the data catalog - woot woot" and GONE is the page on Kedro starters. * Reordered the create project material to put the project structure breakdown in the section that introduces key concepts and shorten the Iris tutorial to the bare minimum. I did add visualisation at this point though, to highlight Kedro Viz, as I felt it was coming far too late in the spaceflights tutorial and needed to be more prominent as a feature. * Added a TL;DR page to Get Started which some people could probably just use as-is and ignore the rest of the section. * Starters material has moved into a new section all about "Kedro project setup". Much of that section still needs review/revision but I have updated the Starters page so it reads more clearly. * Improved the Kedro-Viz page somewhat (still more to come for Plotly) * Notebooks/IPython materials now merged and simplified Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Remove upper bound Signed-off-by: Nok <nok_lam_chan@mckinsey.com> Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> Signed-off-by: Daniel Falk <daniel.falk.1@fixedit.ai> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> Signed-off-by: Kirill Korotkov <korotkovkm@gmail.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Co-authored-by: Nok Lam Chan <mediumnok@gmail.com> Co-authored-by: Nok Lam Chan <nok_lam_chan@mckinsey.com> Co-authored-by: Daniel Falk <daniel@da-robotteknik.se> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: clotildeguinard <clotilde_guinard@hotmail.fr> Co-authored-by: Yetunde Dada <43755008+yetudada@users.noreply.github.com> Co-authored-by: Cyril Korotkov <korotkovkm@gmail.com> Co-authored-by: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com> Co-authored-by: Ahdra Merali <ahdra.merali@quantumblack.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

clotildeguinard added 4 commits September 22, 2022 11:55

Add execution_options to SQLQueryDataSet

81726b0

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

Update RELEASE.md

521eb81

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

Fix lint issues

69ce2f3

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

Add a check in the unit test

e860034

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

clotildeguinard requested a review from idanov as a code owner September 22, 2022 10:35

Add test for connection reuse behaviour

b9e5309

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

clotildeguinard marked this pull request as draft September 22, 2022 11:03

Inject execution_options at load time

cfa45af

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

clotildeguinard marked this pull request as ready for review September 22, 2022 12:05

merelcht mentioned this pull request Nov 7, 2022

Release Kedro 0.18.4 #2004

Closed

10 tasks

noklam approved these changes Nov 10, 2022

View reviewed changes

merelcht reviewed Nov 10, 2022

View reviewed changes

noklam self-assigned this Nov 15, 2022

enhance doc and fix incorrect example

a5fec04

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

noklam and others added 2 commits November 15, 2022 14:47

More docs change

99b2856

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

Merge branch 'main' into feat/add-support-for-exec-options-in-sqlquer…

6f55154

…ydataset

noklam requested a review from merelcht November 15, 2022 15:34

merelcht approved these changes Nov 15, 2022

View reviewed changes

Fixing typos

ac156b7

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

noklam merged commit 665b46d into kedro-org:main Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add execution_options to SQLQueryDataSet #1865

Add execution_options to SQLQueryDataSet #1865

clotildeguinard commented Sep 22, 2022 •

edited by gitpod-io bot

Loading

clotildeguinard commented Sep 22, 2022

clotildeguinard commented Sep 22, 2022

noklam commented Sep 26, 2022 •

edited

Loading

clotildeguinard commented Sep 26, 2022

merelcht commented Sep 30, 2022

noklam left a comment

noklam Nov 10, 2022

merelcht left a comment

merelcht Nov 10, 2022

merelcht Nov 10, 2022

noklam commented Nov 15, 2022

merelcht left a comment

merelcht Nov 15, 2022

noklam Nov 15, 2022

clotildeguinard commented Nov 15, 2022

noklam commented Nov 15, 2022

	Advance example using the `stream_results` and `chunk_size` option to reduce memorgy usage
	Advanced example using the `stream_results` and `chunksize` option to reduce memory usage

Add execution_options to SQLQueryDataSet #1865

Add execution_options to SQLQueryDataSet #1865

Conversation

clotildeguinard commented Sep 22, 2022 • edited by gitpod-io bot Loading

Description

Development notes

Checklist

clotildeguinard commented Sep 22, 2022

clotildeguinard commented Sep 22, 2022

noklam commented Sep 26, 2022 • edited Loading

clotildeguinard commented Sep 26, 2022

merelcht commented Sep 30, 2022

noklam left a comment

Choose a reason for hiding this comment

noklam Nov 10, 2022

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

merelcht Nov 10, 2022

Choose a reason for hiding this comment

merelcht Nov 10, 2022

Choose a reason for hiding this comment

noklam commented Nov 15, 2022

merelcht left a comment

Choose a reason for hiding this comment

merelcht Nov 15, 2022

Choose a reason for hiding this comment

noklam Nov 15, 2022

Choose a reason for hiding this comment

clotildeguinard commented Nov 15, 2022

noklam commented Nov 15, 2022

clotildeguinard commented Sep 22, 2022 •

edited by gitpod-io bot

Loading

noklam commented Sep 26, 2022 •

edited

Loading