Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved SQL functionality #880

Closed
BenjaminLevyQB opened this issue Sep 8, 2021 · 8 comments
Closed

Improved SQL functionality #880

BenjaminLevyQB opened this issue Sep 8, 2021 · 8 comments
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature

Comments

@BenjaminLevyQB
Copy link
Contributor

Linked to PR #879

Description

Kedro's SQL functionality is still missing some key features, some of which this PR seeks to add. Specifically, two main features are added:

  1. The existing pandas.SQLQueryDataSet is modified to allow for a long SQL query to be stored in a file and referenced through the filepath argument
  2. A new dataset, sql.SQLConnectionDataSet is added to give the user access to a sqlalchemy Connection object

Context

Being able to run complex queries on SQL databases is essential for many data science projects. However, doing this in a kedronic way, where all the I/O logic is offloaded to the catalog, is difficult when the queries are complex or it is preferable to use something other than pandas (extremely large datasets shouldn't be loaded into memory, for instance).

@BenjaminLevyQB BenjaminLevyQB added the Issue: Feature Request New feature or improvement to existing feature label Sep 8, 2021
@antonymilne
Copy link
Contributor

A couple of general comments here - I'll save specific ones for the PR.

  1. The existing pandas.SQLQueryDataSet is modified to allow for a long SQL query to be stored in a file and referenced through the filepath argument

This is a great idea. It's been talked about before (@tomvigrass) but never properly implemented on kedro. The one question it maybe raises is whether it's better to have a filepath argument for pandas.SQLQueryDataSet or to create a whole new dataset for this. Personally I like your approach. It does change the API slightly compared to pandas.read_sql_query, which is something we try to avoid wherever possible, but I think it's fine here.

  1. A new dataset, sql.SQLConnectionDataSet is added to give the user access to a sqlalchemy Connection object

This sounds good but I'm afraid I don't understand sqlalchemy well enough to really understand its significance. Please could you explain a bit why this is useful?

Also 👍 for "kedronic" 😀

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Sep 9, 2021

Hello all,

sorry to butt in but I have a lot to say on this issue. First, a small disclaimer: I am not part of the kedro team and my vision on this may not be shared by the core team. However, i think it is valuable so I'll give it anyway :)

On the one hand, I share all the points discussed here:

  • you do not want to load everything in memory
  • you want to make the computation part of the nodes instead of the catalog
  • you do not want to recreate the connection for each dataset instance...)
  • this is not kedronic (👍 for the name too ;)) because you do not have a well identified place to create this connection

And a lot more I will discussed later :)

On the other hand, I don't think the solution you propose in your point 1. is the right way to handle this. I really like the idea of creating a sql.SQLConnectionDataSet you can propagate through nodes though (solution 2.), but it is a very limited way in terms of functionnality (you are stuck to sql alchemy and it only focuses on sql query running which is a very specific use case, you still recreate the connection on load which is "network heavy", you can use it only inside nodes and not inside the catalog itself for other datasets...) and it does not sound kedronic either (it is declared in the catalog while it is not related to data loading/saving!).

I won't elaborate more in this issue because I have no time right now, but for the record:

  1. I am currently writing a "Universal Kedro deployment" issue (a serie of detailed software design documentation) on this very topic. It takes times, because making things both detailed enough to implement it, general enough to solve a lot of common use case and understandable to the greater number is not an easy task.
  2. I will publish a kedro-engines plugin very soon (likely within 1 month, maybe 2) to handle this use case and many more. The development has already begun.

I'll give some follow here once I am ready to share what I came up with for solving this.

@BenjaminLevyQB
Copy link
Contributor Author

@AntonyMilneQB I had the exact same discussion about whether to extend SQLQueryDataSet or write a new class entirely with @datajoely. I started with the new class approach and quickly realized that 90% of the code was identical. I also think that it makes more sense to have the same class from a user perspective. Imagine how someone might use this dataset: (1) start off with a simple query (SELECT col1 from table_a) and then (2) the query gets longer and longer until finally it's stored in a separate file. This would be easier if they simply need to change the sql argument to a filepath argument, rather than having to look up and find the name of a new dataset.

As for the use of the SQLConnectionDataSet, this is something I'm still working on an example use case for (see the docstring) so I hope to have a more concrete answer (@datajoely if you have a more specific idea...) but in a nutshell, the idea is that it might be good to allow for a user to perform all data manipulations for a given node entirely on a SQL database and never load anything into memory. This is of course possible if they were to create the sqlalchemy object themselves, but creating this dataset allows for the use of Kedro credentials and less hardcoding, as one possible advantage.

@Galileo-Galilei thank you for your comments. I actually agree that the solution in point 1 (the SQLQueryDataSet extension) does not achieve the goals of avoiding loading everything in memory. That goal is entirely for point 2 (the SQLConnectionDataSet). What is achieved by the SQLQueryDataSet is simply allowing for complex SQL queries to be written in an external .sql file outside of the catalog (taking advantage of syntax highlighting and other IDE tools, which seems to be a small thing but honestly can be quite impactful for the developer experience; as well as making the catalog less cluttered). This dataset is indeed intended to load the result into memory 😄 . So in summary, this PR has two datasets, which solve two distinct but related issues having to do with SQL functionality in Kedro.

@BenjaminLevyQB
Copy link
Contributor Author

Update, this PR has been split into #886 and #887

@antonymilne
Copy link
Contributor

Thanks for the explanation @BenjaminLevyQB and also the thoughtful comments as ever @Galileo-Galilei.

Completely agree with you on the SQLQueryDataSet and why it's cleaner to extend SQLQueryDataSet to accept filepath rather than make a new class for it.

@Galileo-Galilei Do you think there is an actual problem with the point 1 (extending SQLQueryDataSet to accept filepath)? I think everyone agrees it's not the complete solution to how kedro handles SQL, but it seems like a very valuable (and easy to make) additional capability for a dataset that already exists and is used, even if the whole SQL dataset approach isn't the right solution ultimately.

sql.SQLConnectionDataSet will take me longer to think about given I don't do much stuff in SQL any more... But it also strikes me as a good step forward, even if it's not the full solution. @Galileo-Galilei Do you think adding the dataset is actually a bad move as an incremental step towards a happier kedro-SQL world or again just that it's not the full solution? Fully understand that long-term handling SQL through datasets in the data catalog might not be right, but again this feature seems to me like a very valuable improvement to current capabilities even if it's not optimal.

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Sep 11, 2021

Do you think there is an actual problem with the point 1 (extending SQLQueryDataSet to accept filepath)

Not at all. What I mean is that patching the SQLQueryDataSet is not a sustainable long term solution. In my opinion, the right solution is to remove from the catalog all the datasets which perform computation on different backend (the XXXQueryDataSet). Obviously, this assume we have a better way to perform such operations and it is not the case now. This PR solve a symptom and this is very useful for now but not the real issue ( in french I would say that we"put a bandage on a wooden leg" - not sure how to translate it, but it is quite explicit 😄 ).

Do you think adding the [sql.SQLConnectionDataSet] is actually a bad move as an incremental step towards a happier kedro-SQL world or again just that it's not the full solution?

I think it is a very good move towards the right solution, and what I will suggest will be very similar. My main concern here is that I think the DataCatalog is not the right place to declare such a connection (I mean sql.SQLConnectionDataSet should not be a DataSet but another object more suited to how we want to use it). However, I acknowledge that it is currently the best place because you can leverage Kedro's credentials mechanism so it is certainly a first step towards a "happier kedro-SQL world".

To summarize if those 2 datasets were to be released tomorrow, I would probably make an extensive use of these :). Since refactoring the catalog is something that will likely take months (years?), it makes sense to provide them as a short term solution to some problems users are facing with Kedro/SQL interaction right now.

@stale
Copy link

stale bot commented Nov 10, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 10, 2021
@stale stale bot closed this as completed Nov 17, 2021
@datajoely datajoely reopened this Nov 17, 2021
@stale stale bot removed the stale label Nov 17, 2021
@merelcht merelcht added the Community Issue/PR opened by the open-source community label Mar 7, 2022
@merelcht
Copy link
Member

I'm closing this issue, because both PRs related to it have been addressed. Feel free to open another issue if any more concrete things can be implemented/changed or start a github discussion to discuss this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

5 participants