Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support schema load argument for SparkDataset #986

Merged
merged 155 commits into from
Apr 19, 2022

Conversation

lvijnck
Copy link
Contributor

@lvijnck lvijnck commented Oct 24, 2021

Description

PR provides the ability to specify a json schema file to the load arguments of the SparkDataset. This is especially relevant when loading JSON files. The load file is passed to the schema method of the Spark reader when loading the dataset, i.e., specified in the catalog as follows:

dataset_with_schema:
  type: spark.SparkDataSet
  file_format: json
  filepath: /path/to/*.json
  load_args:
    schema: 
        filepath: path/to/schema.json
        credentials: creds

Additionally, the schema argument can be supplied directly via the API. This can be done either by passing in the a StructType object or a DDL statement, i.e.,

SparkDataSet(..., load_args={"schema": struct_type_obj})
SparkDataSet(..., load_args={"schema": "DDL statement"})

Development notes

Changes are limited to Kedro's extra module. Initial test was added.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change and added my name to the list of supporting contributions in the RELEASE.md file
  • Added tests to cover my changes

Notice

  • I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":

  • I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.

  • I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.

  • I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.

@antonymilne
Copy link
Contributor

Hi @lvijnck, thanks very much for your contribution. I like the idea, but not quite sure about its place in kedro. Generally we try to leave the dataset's underlying save/load API as unchanged as possible and just insert arguments via **load_args. I assume that is not possible here (i.e. read_obj.load doesn't accept a schema object, but instead you need to make read_obj.load.schema)?

If we can't preserve the load API then I believe the options would be:

  1. modify SparkDataSet to piggyback schema_json_path off load_args as you do here
  2. modify SparkDataSet to have a whole new (optional) top-level key for schema_json_path
  3. create an entirely new dataset SparkSchemaDataSet, inheriting from SparkDataSet, which does one of the above (so leaves the plain SparkDataSet unchanged)

As you suggest, in an ideal world it would also be nice to allow schema_json_path to take the full range of fsspec storage locations that filepath can. Is there also some way of specifying a schema when saving, or is this just relevant to load?

So overall, just to set expectations... I'm not sure whether this is going to be too complicated/API-changing to be added to kedro and instead should remain as a customisation that users can make on their own project (by doing option 3 above). But let me get some people who know more about spark than I do to weigh in 🙂

@datajoely
Copy link
Contributor

@AntonyMilneQB I think it's very valid to talk about how we adapt the API and how far we deviate - but I also think this is as legal as the #885 PR

@antonymilne
Copy link
Contributor

@AntonyMilneQB I think it's very valid to talk about how we adapt the API and how far we deviate - but I also think this is as legal as the #885 PR

Cool, if you think this is as legitimate as other deviations we have then that is good 👍

@lvijnck
Copy link
Contributor Author

lvijnck commented Oct 25, 2021

@AntonyMilneQB Hi, thanks for your response! I think your point is fair. However, I feel like it does not make sense to require that users have to create a custom class for functionality as basic as this. Adding this PR as I required the code for my current project. Feel free to decline the PR if this overly segregates the interface/class.

@lvijnck
Copy link
Contributor Author

lvijnck commented Nov 8, 2021

Opening PR to spark discussion. Eager to finish this if the goal is to bring it in.

@lvijnck lvijnck marked this pull request as ready for review November 8, 2021 16:41
@lvijnck lvijnck requested a review from idanov as a code owner November 8, 2021 16:41
Copy link
Contributor

@lorenabalan lorenabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating this PR @lvijnck. Being able to read with a schema is a valuable addition to the dataset. I have some question around whether it is definitely always json or if we're coding ourselves into a corner by limiting it that way.

@antonymilne
Copy link
Contributor

antonymilne commented Nov 17, 2021

Let me add some thoughts following discussion with @jiriklein a couple of weeks ago. Overall his biggest concern was about how to handle the schema filesystem so that it worked on dbfs and other storage solutions. If there's some way of achieving this then I think we're good to add this.

#887 did something similar by creating a new filepath argument for a SQL query and a new fs_args argument so that filepath can use all supported storage solutions. It's obviously trickier here because SparkDataSet already has filepath, which is subject to various bits of custom logic for different filesystems.

Would it be a reasonable simplification to make the the schema live on the same filesystem as the dataset? Or does that not make sense e.g. on dbfs?

If we can make this simplification then that's great because we can just reuse the same logic to process schema_path as we do for filepath. If we can't make the simplification and actually need to maintain two separate bits of logic independently for filepath and schema_path then that is trickier. Essentially you're injecting a JSONDataSet into the definition of a different dataset type in the catalog. We don't have a great mechanism for doing that, but maybe actually using JSONDataSet to do all the filepath etc. processing would help simplify things. I'm thinking something like this:

dataset_with_schema:
  type: spark.SparkDataSet
  file_format: json
  filepath: /path/to/*.json
  load_args:
    schema_args: 
      filepath: s3://path/to/schema.json
      credentials: blah

Then in the SparkDataSet you can just use JSONDataSet(**schema_args) which handles all the file system logic itself. If schema can be more general than JSON then in theory that's also possible by passing dataset_type here and then using load_obj, but it's not ideal.

Also @lorenabalan: Jiri said that schema isn't relevant when saving a dataset, just when loading. So arguably it's best to keep it in load_args, but equally there's no potentially clashes with save schemas if it becomes too unwieldy and is moved to top-level key.

@antonymilne
Copy link
Contributor

Another thought: instead of embedding the JSONDataSet definition in the SparkDataSet definition as above (kind of in the style of PartitionedDataSet, what about just referring to another dataset in the definition?

dataset_with_schema:
  type: spark.SparkDataSet
  file_format: json
  filepath: /path/to/*.json
  load_args:
    schema_dataset: json_schema

json_schema:
    filepath: s3://path/to/schema.json
    credentials: blah

Advantages: much less nesting/simpler; fully general as you can immediately use any sort of dataset type for schema (if it is indeed possible to use something other than json). Disadvantages: having a catalog entry that depends on another like this is completely different from how we normally do things in kedro; no idea how you would actually implement it given you've coupled two datasets together so need catalog available in SparkDataSet to do catalog.load(schema_dataset)?

Overall I'm pretty sure this is a very bad idea, but interested in seeing what others think.

@datajoely
Copy link
Contributor

Overall I'm pretty sure this is a very bad idea, but interested in seeing what others think.

It's an interesting thought but I'm inclined to agree. We don't want to build a dependency tree!

All in all I think this should follow the same pattern as the sql file addition discussed.

@antonymilne
Copy link
Contributor

It's an interesting thought but I'm inclined to agree. We don't want to build a dependency tree!

All in all I think this should follow the same pattern as the sql file addition discussed.

How would you deal with schema file on different filesystems in this case? From above: Would it be a reasonable simplification to make the the schema live on the same filesystem as the dataset? Or does that not make sense e.g. on dbfs?

@datajoely
Copy link
Contributor

How would you deal with schema file on different filesystems in this case? From above: Would it be a reasonable simplification to make the the schema live on the same filesystem as the dataset? Or does that not make sense e.g. on dbfs?

I think fsspec is more than 'good enough' and we could revisit this if people start asking for more.

@lvijnck lvijnck marked this pull request as draft December 2, 2021 19:38
@SajidAlamQB
Copy link
Contributor

@lvijnck Are you still wanting to complete this PR?

@lvijnck
Copy link
Contributor Author

lvijnck commented Feb 7, 2022

@lvijnck Are you still wanting to complete this PR?

Yeah, aiming to complete it somewhere this week.

@lvijnck lvijnck marked this pull request as ready for review February 27, 2022 19:52
@lvijnck
Copy link
Contributor Author

lvijnck commented Feb 27, 2022

@lorenabalan Sorry for the slight delay in finalizing this, just tackled items we discussed live. There are 2 minor comments still open, LMK what you think.

@merelcht merelcht added the Community Issue/PR opened by the open-source community label Mar 7, 2022
Copy link
Contributor

@lorenabalan lorenabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So sorry for the late review! It's 90% there, just some testing improvements and it'll be ready to merge. 🚀

@lorenabalan
Copy link
Contributor

@lorenabalan There are 2 minor comments still open, LMK what you think.

I found the DataSetError one, but what's the second question?

@lvijnck
Copy link
Contributor Author

lvijnck commented Mar 13, 2022

@lorenabalan There are 2 minor comments still open, LMK what you think.

I found the DataSetError one, but what's the second question?

@lorenabalan my bad, it was concerning the credentials to read the schema. Do you think we should also add a test for that? AFAIK I'm directly handing off the credentials to the FS.

@lvijnck lvijnck requested a review from lorenabalan March 13, 2022 10:24
@merelcht merelcht requested a review from antonymilne March 21, 2022 15:43
SajidAlamQB and others added 14 commits April 7, 2022 15:02
* Bump up version to 0.17.7

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* Update CITATION.cff

Co-authored-by: Antony Milne <49395058+AntonyMilneQB@users.noreply.github.com>

* changes based on review

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

Co-authored-by: Antony Milne <49395058+AntonyMilneQB@users.noreply.github.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Co-authored-by: Lorena Bălan <lorena.balan@quantumblack.com>
Co-authored-by: Lorena Bălan <lorena.balan@quantumblack.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
…o-org#1302)

Signed-off-by: lorenabalan <lorena.balan@quantumblack.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
@lvijnck lvijnck force-pushed the feat/extras_spark_schema_support branch from 540b419 to 6ea2302 Compare April 7, 2022 14:06
@lvijnck lvijnck requested a review from yetudada as a code owner April 7, 2022 14:06
merelcht and others added 4 commits April 7, 2022 16:17
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks so much for all your work getting it over the line!

@merelcht
Copy link
Member

You can ignore the failing tests for the test_incremental_dataset and test_partitioned_dataset , but we do need to reach 100% coverage and currently kedro/extras/datasets/spark/spark_dataset.py line 364 isn't covered. (see the report here)

@merelcht merelcht merged commit 340a212 into kedro-org:main Apr 19, 2022
@datajoely
Copy link
Contributor

That was an ordeal! Well done @lvijnck !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
None yet
Development

Successfully merging this pull request may close these issues.