-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snowpark (Snowflake) dataset for kedro #104
Conversation
Signed-off-by: Vladimir Filimonov <vladimir_filimonov@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Snowflake/snowpark dataset implementation
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
@deepyaman I addressed the comments and pushed a couple of commits but I see lint check failing now. I don't think its related to the snowpark changes, can you confirm? thanks! |
@heber-urdaneta Does the lint still fail if you pull the latest changes from main? |
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
@AhdraMeraliQB thanks! Most errors were fixed, but had to push an additional commit to fix the video_dataset, hope that's fine! All checks passed now |
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Hi @Vladimir-Filimonov thank you again for your patience on this we've got together as team and have reached a consensus on the right way forward. We make an effort not to extend the underlying API of a dataset and this is why we're a little uncomfortable supporting the Asks for you:
Roadmap for ourselves (or any community contributors who would like to get involved):
|
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Remove pd interactions and add docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you this is really really looking good
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
@merelcht I think this is ready for final review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work @Vladimir-Filimonov ! Thanks for the contribution 🎉
kudos to @heber-urdaneta ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this contribution!! ⭐
I left some minor comments around wording of docs. Also don't forget to update the release notes with this addition. See the previous notes for how we format dataset additions: https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/RELEASE.md
One can skip everything but "table_name" if database and | ||
schema provided via credentials. Therefore catalog entries can be shorter | ||
if ex. all used Snowflake tables live in same database/schema. | ||
Values in dataset definition take priority over ones defined in credentials |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One can skip everything but "table_name" if database and | |
schema provided via credentials. Therefore catalog entries can be shorter | |
if ex. all used Snowflake tables live in same database/schema. | |
Values in dataset definition take priority over ones defined in credentials | |
You can skip everything but "table_name" if the database and | |
schema are provided via credentials. That way catalog entries can be shorter | |
if, for example, all used Snowflake tables live in same database/schema. | |
Values in the dataset definition take priority over those defined in credentials. |
Credentials file provides all connection attributes, catalog entry | ||
"weather" reuse credentials parameters, "polygons" catalog entry reuse | ||
all credentials parameters except providing different schema name. | ||
Second example of credentials file uses externalbrowser authentication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Credentials file provides all connection attributes, catalog entry | |
"weather" reuse credentials parameters, "polygons" catalog entry reuse | |
all credentials parameters except providing different schema name. | |
Second example of credentials file uses externalbrowser authentication | |
Credentials file provides all connection attributes, catalog entry | |
"weather" reuses credentials parameters, "polygons" catalog entry reuses | |
all credentials parameters except providing a different schema name. | |
Second example of credentials file uses ``externalbrowser`` authentication |
user: "john_doe@wdomain.com" | ||
authenticator: "externalbrowser" | ||
|
||
As of Jan-2023, the snowpark connector only works with Python 3.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth to put this all the way at the top of the class doc string. I can imagine a lot of users would just skip reading the examples.
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for this contribution! 🎉 I'll get it merged in.
@Vladimir-Filimonov I don't seem to be allowed to push changes to your branch. Could you please resolve the merge conflicts for the release notes? Then we can merge it in. |
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Update branch
@merelcht thanks for the note, conflict was solved and branch should be ready to merge! |
* Add Snowpark datasets Signed-off-by: Vladimir Filimonov <vladimir_filimonov@mckinsey.com> Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
* Add Snowpark datasets Signed-off-by: Vladimir Filimonov <vladimir_filimonov@mckinsey.com> Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
* Add Snowpark datasets Signed-off-by: Vladimir Filimonov <vladimir_filimonov@mckinsey.com> Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
* Add Snowpark datasets Signed-off-by: Vladimir Filimonov <vladimir_filimonov@mckinsey.com> Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
* Add Snowpark datasets Signed-off-by: Vladimir Filimonov <vladimir_filimonov@mckinsey.com> Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Description
Ready for review PR summarising work and discussions happened as part of #78 (but we needed clean start to make all git commits signed properly).
This PR:
This allows kedro users to work with Snowflake data using Snowpark dataframes that attempt to mimic pyspark dataframes interface.
Development notes
Snowpark package from Snowflake works only with python 3.8 link. It also requires higher version of pyarrow so we had to bump the version in requirements.
How to run tests
To run tests you need to have Snowflake instance to run them against.
Under
kedro-datasets/tests/snowflake
you can find readme explaining how to run tests locally. Also you'll find guidance on what permissions on Snowflake user needs to have in order to have tests executed successfully.Snowpark-related tests disabled by default from pytest scope. Snowpark dataset class also excluded from test coverage report (as tests don't run by default and lower overall coverage report otherwise).
Checklist
RELEASE.md
file