Support schema load argument for SparkDataset #986

lvijnck · 2021-10-24T12:26:18Z

Description

PR provides the ability to specify a json schema file to the load arguments of the SparkDataset. This is especially relevant when loading JSON files. The load file is passed to the schema method of the Spark reader when loading the dataset, i.e., specified in the catalog as follows:

dataset_with_schema:
  type: spark.SparkDataSet
  file_format: json
  filepath: /path/to/*.json
  load_args:
    schema: 
        filepath: path/to/schema.json
        credentials: creds

Additionally, the schema argument can be supplied directly via the API. This can be done either by passing in the a StructType object or a DDL statement, i.e.,

SparkDataSet(..., load_args={"schema": struct_type_obj})
SparkDataSet(..., load_args={"schema": "DDL statement"})

Development notes

Changes are limited to Kedro's extra module. Initial test was added.

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change and added my name to the list of supporting contributions in the RELEASE.md file
Added tests to cover my changes

Notice

I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":
I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.
I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.

kedro/extras/datasets/spark/spark_dataset.py

antonymilne · 2021-10-25T09:55:14Z

Hi @lvijnck, thanks very much for your contribution. I like the idea, but not quite sure about its place in kedro. Generally we try to leave the dataset's underlying save/load API as unchanged as possible and just insert arguments via **load_args. I assume that is not possible here (i.e. read_obj.load doesn't accept a schema object, but instead you need to make read_obj.load.schema)?

If we can't preserve the load API then I believe the options would be:

modify SparkDataSet to piggyback schema_json_path off load_args as you do here
modify SparkDataSet to have a whole new (optional) top-level key for schema_json_path
create an entirely new dataset SparkSchemaDataSet, inheriting from SparkDataSet, which does one of the above (so leaves the plain SparkDataSet unchanged)

As you suggest, in an ideal world it would also be nice to allow schema_json_path to take the full range of fsspec storage locations that filepath can. Is there also some way of specifying a schema when saving, or is this just relevant to load?

So overall, just to set expectations... I'm not sure whether this is going to be too complicated/API-changing to be added to kedro and instead should remain as a customisation that users can make on their own project (by doing option 3 above). But let me get some people who know more about spark than I do to weigh in 🙂

datajoely · 2021-10-25T10:03:32Z

@AntonyMilneQB I think it's very valid to talk about how we adapt the API and how far we deviate - but I also think this is as legal as the #885 PR

antonymilne · 2021-10-25T10:06:11Z

@AntonyMilneQB I think it's very valid to talk about how we adapt the API and how far we deviate - but I also think this is as legal as the #885 PR

Cool, if you think this is as legitimate as other deviations we have then that is good 👍

lvijnck · 2021-10-25T11:38:29Z

@AntonyMilneQB Hi, thanks for your response! I think your point is fair. However, I feel like it does not make sense to require that users have to create a custom class for functionality as basic as this. Adding this PR as I required the code for my current project. Feel free to decline the PR if this overly segregates the interface/class.

lvijnck · 2021-11-08T16:41:38Z

Opening PR to spark discussion. Eager to finish this if the goal is to bring it in.

lorenabalan

Thank you for creating this PR @lvijnck. Being able to read with a schema is a valuable addition to the dataset. I have some question around whether it is definitely always json or if we're coding ourselves into a corner by limiting it that way.

RELEASE.md

kedro/extras/datasets/spark/spark_dataset.py

tests/extras/datasets/spark/test_spark_dataset.py

kedro/extras/datasets/spark/spark_dataset.py

antonymilne · 2021-11-17T17:35:41Z

Let me add some thoughts following discussion with @jiriklein a couple of weeks ago. Overall his biggest concern was about how to handle the schema filesystem so that it worked on dbfs and other storage solutions. If there's some way of achieving this then I think we're good to add this.

#887 did something similar by creating a new filepath argument for a SQL query and a new fs_args argument so that filepath can use all supported storage solutions. It's obviously trickier here because SparkDataSet already has filepath, which is subject to various bits of custom logic for different filesystems.

Would it be a reasonable simplification to make the the schema live on the same filesystem as the dataset? Or does that not make sense e.g. on dbfs?

If we can make this simplification then that's great because we can just reuse the same logic to process schema_path as we do for filepath. If we can't make the simplification and actually need to maintain two separate bits of logic independently for filepath and schema_path then that is trickier. Essentially you're injecting a JSONDataSet into the definition of a different dataset type in the catalog. We don't have a great mechanism for doing that, but maybe actually using JSONDataSet to do all the filepath etc. processing would help simplify things. I'm thinking something like this:

dataset_with_schema:
  type: spark.SparkDataSet
  file_format: json
  filepath: /path/to/*.json
  load_args:
    schema_args: 
      filepath: s3://path/to/schema.json
      credentials: blah

Then in the SparkDataSet you can just use JSONDataSet(**schema_args) which handles all the file system logic itself. If schema can be more general than JSON then in theory that's also possible by passing dataset_type here and then using load_obj, but it's not ideal.

Also @lorenabalan: Jiri said that schema isn't relevant when saving a dataset, just when loading. So arguably it's best to keep it in load_args, but equally there's no potentially clashes with save schemas if it becomes too unwieldy and is moved to top-level key.

antonymilne · 2021-11-17T17:43:31Z

Another thought: instead of embedding the JSONDataSet definition in the SparkDataSet definition as above (kind of in the style of PartitionedDataSet, what about just referring to another dataset in the definition?

dataset_with_schema:
  type: spark.SparkDataSet
  file_format: json
  filepath: /path/to/*.json
  load_args:
    schema_dataset: json_schema

json_schema:
    filepath: s3://path/to/schema.json
    credentials: blah

Advantages: much less nesting/simpler; fully general as you can immediately use any sort of dataset type for schema (if it is indeed possible to use something other than json). Disadvantages: having a catalog entry that depends on another like this is completely different from how we normally do things in kedro; no idea how you would actually implement it given you've coupled two datasets together so need catalog available in SparkDataSet to do catalog.load(schema_dataset)?

Overall I'm pretty sure this is a very bad idea, but interested in seeing what others think.

datajoely · 2021-11-17T20:23:07Z

Overall I'm pretty sure this is a very bad idea, but interested in seeing what others think.

It's an interesting thought but I'm inclined to agree. We don't want to build a dependency tree!

All in all I think this should follow the same pattern as the sql file addition discussed.

antonymilne · 2021-11-19T16:48:01Z

It's an interesting thought but I'm inclined to agree. We don't want to build a dependency tree!

All in all I think this should follow the same pattern as the sql file addition discussed.

How would you deal with schema file on different filesystems in this case? From above: Would it be a reasonable simplification to make the the schema live on the same filesystem as the dataset? Or does that not make sense e.g. on dbfs?

datajoely · 2021-11-22T09:43:14Z

How would you deal with schema file on different filesystems in this case? From above: Would it be a reasonable simplification to make the the schema live on the same filesystem as the dataset? Or does that not make sense e.g. on dbfs?

I think fsspec is more than 'good enough' and we could revisit this if people start asking for more.

kedro/extras/datasets/spark/spark_dataset.py

SajidAlamQB · 2022-02-07T14:44:07Z

@lvijnck Are you still wanting to complete this PR?

lvijnck · 2022-02-07T14:47:47Z

@lvijnck Are you still wanting to complete this PR?

Yeah, aiming to complete it somewhere this week.

kedro/extras/datasets/spark/spark_dataset.py

lvijnck · 2022-02-27T19:54:13Z

@lorenabalan Sorry for the slight delay in finalizing this, just tackled items we discussed live. There are 2 minor comments still open, LMK what you think.

lorenabalan

So sorry for the late review! It's 90% there, just some testing improvements and it'll be ready to merge. 🚀

kedro/extras/datasets/spark/spark_dataset.py

tests/extras/datasets/spark/test_spark_dataset.py

kedro/extras/datasets/spark/spark_dataset.py

RELEASE.md

tests/extras/datasets/spark/test_spark_dataset.py

lorenabalan · 2022-03-11T11:24:47Z

@lorenabalan There are 2 minor comments still open, LMK what you think.

I found the DataSetError one, but what's the second question?

lvijnck · 2022-03-13T10:22:13Z

@lorenabalan There are 2 minor comments still open, LMK what you think.

I found the DataSetError one, but what's the second question?

@lorenabalan my bad, it was concerning the credentials to read the schema. Do you think we should also add a test for that? AFAIK I'm directly handing off the credentials to the FS.

* Bump up version to 0.17.7 Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com> * Update CITATION.cff Co-authored-by: Antony Milne <49395058+AntonyMilneQB@users.noreply.github.com> * changes based on review Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com> Co-authored-by: Antony Milne <49395058+AntonyMilneQB@users.noreply.github.com> Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Co-authored-by: Lorena Bălan <lorena.balan@quantumblack.com>

Co-authored-by: Lorena Bălan <lorena.balan@quantumblack.com> Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

…o-org#1302) Signed-off-by: lorenabalan <lorena.balan@quantumblack.com> Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

antonymilne

This is great, thanks so much for all your work getting it over the line!

RELEASE.md

merelcht · 2022-04-11T11:37:28Z

You can ignore the failing tests for the test_incremental_dataset and test_partitioned_dataset , but we do need to reach 100% coverage and currently kedro/extras/datasets/spark/spark_dataset.py line 364 isn't covered. (see the report here)

kedro/extras/datasets/spark/spark_dataset.py

# Conflicts: # RELEASE.md

datajoely · 2022-04-19T09:41:54Z

That was an ordeal! Well done @lvijnck !

datajoely reviewed Oct 24, 2021

View reviewed changes

kedro/extras/datasets/spark/spark_dataset.py Outdated Show resolved Hide resolved

lvijnck marked this pull request as ready for review November 8, 2021 16:41

lvijnck requested a review from idanov as a code owner November 8, 2021 16:41

lorenabalan reviewed Nov 16, 2021

View reviewed changes

lvijnck marked this pull request as draft December 2, 2021 19:38

lvijnck commented Dec 8, 2021

View reviewed changes

kedro/extras/datasets/spark/spark_dataset.py Show resolved Hide resolved

lvijnck commented Dec 8, 2021

View reviewed changes

kedro/extras/datasets/spark/spark_dataset.py Outdated Show resolved Hide resolved

lvijnck commented Feb 27, 2022

View reviewed changes

kedro/extras/datasets/spark/spark_dataset.py Outdated Show resolved Hide resolved

lvijnck requested review from lorenabalan and datajoely February 27, 2022 19:42

lvijnck marked this pull request as ready for review February 27, 2022 19:52

merelcht added the Community Issue/PR opened by the open-source community label Mar 7, 2022

lorenabalan reviewed Mar 11, 2022

View reviewed changes

lvijnck requested a review from lorenabalan March 13, 2022 10:24

merelcht requested a review from antonymilne March 21, 2022 15:43

SajidAlamQB and others added 14 commits April 7, 2022 15:02

Update config.yml (kedro-org#1276)

13abd69

Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Fix broken anchors and make absolute links relative (kedro-org#1277)

a9b96aa

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Add name to project team (kedro-org#1261)

56c23fd

Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Update kedro/extras/datasets/spark/spark_dataset.py

4c939af

Co-authored-by: Lorena Bălan <lorena.balan@quantumblack.com>

Update tests/extras/datasets/spark/test_spark_dataset.py

ba62a6b

Co-authored-by: Lorena Bălan <lorena.balan@quantumblack.com> Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

process PR feedback

e232422

Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

add test to cover credentials

8fc667e

Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Document distribution of Kedro pipelines with Dask (kedro-org#1248)

3de5710

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

[KED-3055] Update docs to use pipeline rather than Pipeline (kedr…

44daea5

…o-org#1302) Signed-off-by: lorenabalan <lorena.balan@quantumblack.com> Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Fix link for configuring Kubeflow Pipelines on AWS (kedro-org#1324)

58dd715

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Relax the upper bound on jupyter_client(kedro-org#1322)

25bd127

Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

add example

6893d73

Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

fixed example

6ea2302

Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

lvijnck force-pushed the feat/extras_spark_schema_support branch from 540b419 to 6ea2302 Compare April 7, 2022 14:06

lvijnck requested a review from yetudada as a code owner April 7, 2022 14:06

merelcht and others added 4 commits April 7, 2022 16:17

Try to fix merge conflicts from merging in latest main

b13dbd9

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Fix remaining merge mistakes

120d5cf

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

add to release

bedaaa8

Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>

Fix linting

9ceced3

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

antonymilne approved these changes Apr 11, 2022

View reviewed changes

RELEASE.md Outdated Show resolved Hide resolved

antonymilne and others added 2 commits April 11, 2022 09:32

Update RELEASE.md

a007284

Merge branch 'main' into feat/extras_spark_schema_support

db06d25

lvijnck commented Apr 11, 2022

View reviewed changes

kedro/extras/datasets/spark/spark_dataset.py Outdated Show resolved Hide resolved

lvijnck and others added 3 commits April 17, 2022 10:56

fix merge issue

39c7055

Merge branch 'main' into feat/extras_spark_schema_support

2a716bb

# Conflicts: # RELEASE.md

Merge branch 'main' into feat/extras_spark_schema_support

df35ffc

merelcht merged commit 340a212 into kedro-org:main Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support schema load argument for SparkDataset #986

Support schema load argument for SparkDataset #986

lvijnck commented Oct 24, 2021 •

edited by gitpod-io bot

Loading

antonymilne commented Oct 25, 2021

datajoely commented Oct 25, 2021

antonymilne commented Oct 25, 2021

lvijnck commented Oct 25, 2021 •

edited

Loading

lvijnck commented Nov 8, 2021 •

edited

Loading

lorenabalan left a comment

antonymilne commented Nov 17, 2021 •

edited

Loading

antonymilne commented Nov 17, 2021

datajoely commented Nov 17, 2021

antonymilne commented Nov 19, 2021

datajoely commented Nov 22, 2021

SajidAlamQB commented Feb 7, 2022

lvijnck commented Feb 7, 2022

lvijnck commented Feb 27, 2022

lorenabalan left a comment

lorenabalan commented Mar 11, 2022

lvijnck commented Mar 13, 2022 •

edited

Loading

antonymilne left a comment

merelcht commented Apr 11, 2022

datajoely commented Apr 19, 2022

Support schema load argument for SparkDataset #986

Support schema load argument for SparkDataset #986

Conversation

lvijnck commented Oct 24, 2021 • edited by gitpod-io bot Loading

Description

Development notes

Checklist

Notice

antonymilne commented Oct 25, 2021

datajoely commented Oct 25, 2021

antonymilne commented Oct 25, 2021

lvijnck commented Oct 25, 2021 • edited Loading

lvijnck commented Nov 8, 2021 • edited Loading

lorenabalan left a comment

Choose a reason for hiding this comment

antonymilne commented Nov 17, 2021 • edited Loading

antonymilne commented Nov 17, 2021

datajoely commented Nov 17, 2021

antonymilne commented Nov 19, 2021

datajoely commented Nov 22, 2021

SajidAlamQB commented Feb 7, 2022

lvijnck commented Feb 7, 2022

lvijnck commented Feb 27, 2022

lorenabalan left a comment

Choose a reason for hiding this comment

lorenabalan commented Mar 11, 2022

lvijnck commented Mar 13, 2022 • edited Loading

antonymilne left a comment

Choose a reason for hiding this comment

merelcht commented Apr 11, 2022

datajoely commented Apr 19, 2022

lvijnck commented Oct 24, 2021 •

edited by gitpod-io bot

Loading

lvijnck commented Oct 25, 2021 •

edited

Loading

lvijnck commented Nov 8, 2021 •

edited

Loading

antonymilne commented Nov 17, 2021 •

edited

Loading

lvijnck commented Mar 13, 2022 •

edited

Loading