Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to load SQL queries from a file for SQLQueryDataSet #887

Merged

Conversation

BenjaminLevyQB
Copy link
Contributor

Description

Adds a new filepath argument to the constructor of SQLQueryDataSet that allows the user to specify a file containing a SQL query instead of having to write the entire query in the catalog, allowing users to keep the catalog clean and take advantage of SQL syntax highlighting in the case of long queries

Development notes

  • Added filepath and fs_args to the constructor to allow fsspec-based loading of SQL query files
  • sql parameter is now optional, but an error is raised if neither sql nor filepath are supplied
  • Added test test_load_query_file and modified test_empty_query_error such that it raises the updated error

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change and added my name to the list of supporting contributions in the RELEASE.md file
  • Added tests to cover my changes

Notice

  • I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":

  • I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.

  • I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.

  • I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.

@idanov idanov changed the title creating sql_dataset branch Add an option to load SQL queries from a file for SQLQueryDataSet Sep 9, 2021
@BenjaminLevyQB BenjaminLevyQB marked this pull request as ready for review September 9, 2021 19:52
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @BenjaminLevyQB ! Don't forget to update the RELEASE.md to include the changes as well as your name on the contributor list.

kedro/extras/datasets/pandas/sql_dataset.py Outdated Show resolved Hide resolved
kedro/extras/datasets/pandas/sql_dataset.py Show resolved Hide resolved
Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your excellent contribution! ⭐ I like this feature a lot but have two little concerns:

  1. Is this technically a breaking change? The signature for SQLQueryDataSet has changed so that anyone calling it with positional arguments before (like SQLQueryDataSet("SELECT * FROM table", credentials) presumably won't work now. The principal use will be through catalog.yml, for which this doesn't matter, but I think anyone using the Python API will suffer a breaking change. Curious if others agree here - if so then this should go to develop rather than master
  2. What happens if a user specifies both sql and filepath? Currently it looks like filepath will take precedent, but maybe this should throw an error message instead.

RELEASE.md Outdated Show resolved Hide resolved
RELEASE.md Outdated Show resolved Hide resolved
kedro/extras/datasets/pandas/sql_dataset.py Show resolved Hide resolved
kedro/extras/datasets/pandas/sql_dataset.py Outdated Show resolved Hide resolved
@datajoely
Copy link
Contributor

datajoely commented Sep 29, 2021

What's left on this PR - I think it should be ready to go right? We should close #886 for reasons outlined here

@BenjaminLevyQB
Copy link
Contributor Author

@datajoely looks good to me -- sorry about the delay, can I get a review on this so it can be merged?

@datajoely
Copy link
Contributor

datajoely commented Oct 13, 2021

@BenjaminLevyQB I think there a few comments from @AntonyMilneQB (who is now on parental leave) to address

BenjaminLevyQB and others added 4 commits October 13, 2021 11:40
Co-authored-by: Antony Milne <49395058+AntonyMilneQB@users.noreply.github.com>
Co-authored-by: Antony Milne <49395058+AntonyMilneQB@users.noreply.github.com>
@BenjaminLevyQB
Copy link
Contributor Author

Updated in response to @AntonyMilneQB's comments:

  • If both sql and filepath are given, an error is raised (instead of defaulting silently to sql)
  • filepath is added to the end instead of in the middle of the arguments. This keeps sql being changed to a keyword argument, but now if the arguments are specified positionally (e.g., through the python API) then anyone who did not know about filepath previously will still have their code work.

@BenjaminLevyQB
Copy link
Contributor Author

Can I get a review on the updated code? Thanks!

@antonymilne
Copy link
Contributor

antonymilne commented Oct 20, 2021

Hi @BenjaminLevyQB, thanks for your updates on this and apologies for the delayed review - I just got back from leave. Moving filepath to the end of the arguments is a good idea but I'm not quite sure whether it's the right thing to do here in order to avoid a breaking change. Let me check with someone else on the team to see what they think and then we can get this merged 🙂 Have discussed this and it's fine - good solution for backwards compatibility!

Copy link
Contributor

@lorenabalan lorenabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice feature addition! Left a few suggestions, let me know what you think. 😊

tests/extras/datasets/pandas/test_sql_dataset.py Outdated Show resolved Hide resolved
kedro/extras/datasets/pandas/sql_dataset.py Show resolved Hide resolved
Here you can find all available arguments for `open`:
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.open
All defaults are preserved, except `mode`, which is set to `r` when loading
and to `w` when saving.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and to `w` when saving.
and to `w` when saving.

This can be removed, we don't actually set mode to w ourselves anymore.

kedro/extras/datasets/pandas/sql_dataset.py Outdated Show resolved Hide resolved

def _load(self) -> pd.DataFrame:
load_args = self._load_args.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this return a shallow copy instead of deepcopy? I think the latter is the one we want for dictionaries generally. 🤔

kedro/extras/datasets/pandas/sql_dataset.py Outdated Show resolved Hide resolved

def _load(self) -> pd.DataFrame:
load_args = self._load_args.copy()

if "sql" not in load_args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why not if "filepath" in load_args? 😅 Or if self._filepath if we go with the previous suggestion.

kedro/extras/datasets/pandas/sql_dataset.py Outdated Show resolved Hide resolved
kedro/extras/datasets/pandas/sql_dataset.py Outdated Show resolved Hide resolved
@datajoely
Copy link
Contributor

@BenjaminLevyQB amazing work so far - some little bits needed before we can get this in. I want to release 0.17.6 shortly so this would be a wonderful addition.

@BenjaminLevyQB
Copy link
Contributor Author

Thank you @lorenabalan and @AntonyMilneQB for you excellent comments! The requested changes have been made!!

Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for all your work on this @BenjaminLevyQB. I'll make it sure it gets merged 🙂

kedro/extras/datasets/pandas/sql_dataset.py Outdated Show resolved Hide resolved
kedro/extras/datasets/pandas/sql_dataset.py Outdated Show resolved Hide resolved
tests/extras/datasets/pandas/test_sql_dataset.py Outdated Show resolved Hide resolved
@antonymilne antonymilne self-assigned this Oct 25, 2021
@antonymilne antonymilne merged commit 982dc70 into kedro-org:master Oct 25, 2021
lvijnck pushed a commit to lvijnck/kedro that referenced this pull request Apr 7, 2022
…dro-org#887)

Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants