Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialise PySpark using hooks rather than custom context #1563

Closed
antonymilne opened this issue May 20, 2022 · 5 comments
Closed

Initialise PySpark using hooks rather than custom context #1563

antonymilne opened this issue May 20, 2022 · 5 comments
Assignees

Comments

@antonymilne
Copy link
Contributor

antonymilne commented May 20, 2022

Spun out of #506 (comment).

Currently PySpark is initialised in kedro using a custom context. We now have a much better place to do this: after_context_created hook defined in hooks.py. This would look something like this:

class SparkHooks:
   @hook_impl
   def after_context_created(self, context) -> None:
        """Initialises a SparkSession using the config
        defined in project's conf folder.
        """

        # Load the spark configuration in spark.yaml using the config loader
        parameters = context.config_loader.get("spark*", "spark*/**")
        spark_conf = SparkConf().setAll(parameters.items())

        # Initialise the spark session
        spark_session_conf = (
            SparkSession.builder.appName(self._package_name)
            .enableHiveSupport()
            .config(conf=spark_conf)
        )
        _spark_session = spark_session_conf.getOrCreate()
        _spark_session.sparkContext.setLogLevel("WARN")

In addition to making the hooks.py file you should remove the context.py file and edit settings.py to instantiate SparkHooks in HOOKS and no longer provide a custom context.

We need to change this in a:

  1. Both pyspark and pandas-pyspark starters
  2. Documentation

Maybe in future SparkHooks will live somewhere within the kedro package so that users can just do from kedro.extras.hooks.pyspark import SparkHooks and not need to define the hook themselves.

Related/possible alternative? #904

@antonymilne
Copy link
Contributor Author

@Galileo-Galilei comment:

Lim's comment ("There used to be 2 kinds of hooks: registration & life-cycle. Managing them using the same hook managers was a mistake") and the difficulties you have to locate this new hook is in line with #904 and the "engines" design patterns: registering objects is not the same as calling them during session lifecycle. kedro.extras is likely the right place for now, but it likely need some more thoughts on the design later.

@datajoely
Copy link
Contributor

I've been thinking it may be nice to have some sort of AbstractExecutionContext that we could use for Spark, Dask, Snowpark and others ...

@antonymilne
Copy link
Contributor Author

@datajoely could those also be put into an after_context_created hook? I'm not sure what the code for those would look like, so don't know exactly which kedro objects would be required to instantiate them.

e.g. all the spark instantiation requires access to is context.config_loader, and even that I'm not sure we really need (does anyone actually want to change spark.yml for different environments? Would be good to know).

@datajoely
Copy link
Contributor

I think you have to assume you would want local/dev/prod spark configuration environments. I think all of these can be migrated to this pattern, my push here is to make sure we think about remote execution targets in abstract when developing this.

Personally - I'm very keen to build a Snowpark implementation, but will only do so once this stabilises.

@antonymilne antonymilne changed the title For discussion: make pyspark hooks instead of custom context Initialise pyspark using hooks rather than custom context Jul 11, 2022
@antonymilne antonymilne changed the title Initialise pyspark using hooks rather than custom context Initialise PySpark using hooks rather than custom context Jul 11, 2022
@merelcht merelcht moved this to To Do in Kedro Framework Aug 15, 2022
@jmholzer jmholzer moved this from To Do to In Progress in Kedro Framework Aug 22, 2022
@jmholzer jmholzer self-assigned this Aug 22, 2022
@jmholzer jmholzer removed their assignment Aug 22, 2022
@jmholzer jmholzer moved this from In Progress to To Do in Kedro Framework Aug 22, 2022
@SajidAlamQB SajidAlamQB self-assigned this Aug 30, 2022
@SajidAlamQB SajidAlamQB moved this from To Do to In Progress in Kedro Framework Sep 1, 2022
@SajidAlamQB SajidAlamQB moved this from In Progress to In Review in Kedro Framework Sep 2, 2022
@SajidAlamQB SajidAlamQB moved this from In Review to Done in Kedro Framework Sep 6, 2022
@merelcht
Copy link
Member

merelcht commented Sep 6, 2022

Completed in kedro-org/kedro-starters#102

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

6 participants