-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved SQL functionality #880
Comments
A couple of general comments here - I'll save specific ones for the PR.
This is a great idea. It's been talked about before (@tomvigrass) but never properly implemented on kedro. The one question it maybe raises is whether it's better to have a
This sounds good but I'm afraid I don't understand Also 👍 for "kedronic" 😀 |
Hello all, sorry to butt in but I have a lot to say on this issue. First, a small disclaimer: I am not part of the kedro team and my vision on this may not be shared by the core team. However, i think it is valuable so I'll give it anyway :) On the one hand, I share all the points discussed here:
And a lot more I will discussed later :) On the other hand, I don't think the solution you propose in your point 1. is the right way to handle this. I really like the idea of creating a I won't elaborate more in this issue because I have no time right now, but for the record:
I'll give some follow here once I am ready to share what I came up with for solving this. |
@AntonyMilneQB I had the exact same discussion about whether to extend As for the use of the SQLConnectionDataSet, this is something I'm still working on an example use case for (see the docstring) so I hope to have a more concrete answer (@datajoely if you have a more specific idea...) but in a nutshell, the idea is that it might be good to allow for a user to perform all data manipulations for a given node entirely on a SQL database and never load anything into memory. This is of course possible if they were to create the sqlalchemy object themselves, but creating this dataset allows for the use of Kedro credentials and less hardcoding, as one possible advantage. @Galileo-Galilei thank you for your comments. I actually agree that the solution in point 1 (the |
Thanks for the explanation @BenjaminLevyQB and also the thoughtful comments as ever @Galileo-Galilei. Completely agree with you on the @Galileo-Galilei Do you think there is an actual problem with the point 1 (extending
|
Not at all. What I mean is that patching the SQLQueryDataSet is not a sustainable long term solution. In my opinion, the right solution is to remove from the catalog all the datasets which perform computation on different backend (the
I think it is a very good move towards the right solution, and what I will suggest will be very similar. My main concern here is that I think the To summarize if those 2 datasets were to be released tomorrow, I would probably make an extensive use of these :). Since refactoring the catalog is something that will likely take months (years?), it makes sense to provide them as a short term solution to some problems users are facing with Kedro/SQL interaction right now. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I'm closing this issue, because both PRs related to it have been addressed. Feel free to open another issue if any more concrete things can be implemented/changed or start a github discussion to discuss this further. |
Linked to PR #879
Description
Kedro's SQL functionality is still missing some key features, some of which this PR seeks to add. Specifically, two main features are added:
pandas.SQLQueryDataSet
is modified to allow for a long SQL query to be stored in a file and referenced through thefilepath
argumentsql.SQLConnectionDataSet
is added to give the user access to asqlalchemy
Connection
objectContext
Being able to run complex queries on SQL databases is essential for many data science projects. However, doing this in a kedronic way, where all the I/O logic is offloaded to the catalog, is difficult when the queries are complex or it is preferable to use something other than pandas (extremely large datasets shouldn't be loaded into memory, for instance).
The text was updated successfully, but these errors were encountered: