-
Notifications
You must be signed in to change notification settings - Fork 926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support schema load argument for SparkDataset #986
Support schema load argument for SparkDataset #986
Conversation
Hi @lvijnck, thanks very much for your contribution. I like the idea, but not quite sure about its place in kedro. Generally we try to leave the dataset's underlying save/load API as unchanged as possible and just insert arguments via If we can't preserve the load API then I believe the options would be:
As you suggest, in an ideal world it would also be nice to allow So overall, just to set expectations... I'm not sure whether this is going to be too complicated/API-changing to be added to kedro and instead should remain as a customisation that users can make on their own project (by doing option 3 above). But let me get some people who know more about spark than I do to weigh in 🙂 |
@AntonyMilneQB I think it's very valid to talk about how we adapt the API and how far we deviate - but I also think this is as legal as the #885 PR |
Cool, if you think this is as legitimate as other deviations we have then that is good 👍 |
@AntonyMilneQB Hi, thanks for your response! I think your point is fair. However, I feel like it does not make sense to require that users have to create a custom class for functionality as basic as this. Adding this PR as I required the code for my current project. Feel free to decline the PR if this overly segregates the interface/class. |
Opening PR to spark discussion. Eager to finish this if the goal is to bring it in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for creating this PR @lvijnck. Being able to read with a schema is a valuable addition to the dataset. I have some question around whether it is definitely always json
or if we're coding ourselves into a corner by limiting it that way.
Let me add some thoughts following discussion with @jiriklein a couple of weeks ago. Overall his biggest concern was about how to handle the schema filesystem so that it worked on dbfs and other storage solutions. If there's some way of achieving this then I think we're good to add this. #887 did something similar by creating a new Would it be a reasonable simplification to make the the schema live on the same filesystem as the dataset? Or does that not make sense e.g. on dbfs? If we can make this simplification then that's great because we can just reuse the same logic to process
Then in the Also @lorenabalan: Jiri said that |
Another thought: instead of embedding the
Advantages: much less nesting/simpler; fully general as you can immediately use any sort of dataset type for schema (if it is indeed possible to use something other than json). Disadvantages: having a catalog entry that depends on another like this is completely different from how we normally do things in kedro; no idea how you would actually implement it given you've coupled two datasets together so need Overall I'm pretty sure this is a very bad idea, but interested in seeing what others think. |
It's an interesting thought but I'm inclined to agree. We don't want to build a dependency tree! All in all I think this should follow the same pattern as the sql file addition discussed. |
How would you deal with schema file on different filesystems in this case? From above: Would it be a reasonable simplification to make the the schema live on the same filesystem as the dataset? Or does that not make sense e.g. on dbfs? |
I think fsspec is more than 'good enough' and we could revisit this if people start asking for more. |
@lvijnck Are you still wanting to complete this PR? |
Yeah, aiming to complete it somewhere this week. |
@lorenabalan Sorry for the slight delay in finalizing this, just tackled items we discussed live. There are 2 minor comments still open, LMK what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So sorry for the late review! It's 90% there, just some testing improvements and it'll be ready to merge. 🚀
I found the |
@lorenabalan my bad, it was concerning the credentials to read the schema. Do you think we should also add a test for that? AFAIK I'm directly handing off the credentials to the FS. |
* Bump up version to 0.17.7 Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com> * Update CITATION.cff Co-authored-by: Antony Milne <49395058+AntonyMilneQB@users.noreply.github.com> * changes based on review Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com> Co-authored-by: Antony Milne <49395058+AntonyMilneQB@users.noreply.github.com> Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Co-authored-by: Lorena Bălan <lorena.balan@quantumblack.com>
Co-authored-by: Lorena Bălan <lorena.balan@quantumblack.com> Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
…o-org#1302) Signed-off-by: lorenabalan <lorena.balan@quantumblack.com> Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
540b419
to
6ea2302
Compare
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks so much for all your work getting it over the line!
You can ignore the failing tests for the |
That was an ordeal! Well done @lvijnck ! |
Description
PR provides the ability to specify a json schema file to the load arguments of the SparkDataset. This is especially relevant when loading JSON files. The load file is passed to the
schema
method of the Spark reader when loading the dataset, i.e., specified in the catalog as follows:Additionally, the schema argument can be supplied directly via the API. This can be done either by passing in the a
StructType
object or a DDL statement, i.e.,Development notes
Changes are limited to Kedro's extra module. Initial test was added.
Checklist
RELEASE.md
fileNotice
I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":
I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.
I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.