Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parametrized session_id #1571

Closed
wants to merge 3 commits into from

Conversation

ARashitov
Copy link

@ARashitov ARashitov commented May 24, 2022

Description

Use case

I have web application where I use kedro inside of endpoints. This web application has centralized logs, but I'm struggling with tracing which kedro session is executed for which endpoint since application is horizontally scaled.

How I resolve this one

I add session_id argument to KedroSession.create() function and then i implemented kedro hooks for logging and linking with endpoints trace id

Development notes

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

@ARashitov ARashitov requested a review from idanov as a code owner May 24, 2022 14:34
@ARashitov ARashitov marked this pull request as draft May 24, 2022 14:35
@datajoely
Copy link
Contributor

Related to this issue raised by @noklam #1551

@noklam
Copy link
Contributor

noklam commented May 24, 2022

I think there are 2 points to discuss here:

  1. Does save_version = session_id still valid? Or do we need something more? In the context of Should we allow users provide custom session_id  #1551, what user need is an ID for the entire DAG run instead of the individual session.
  2. I didn't expect kedro pipeline will be used as an endpoint. Currently we have 1 session = 1 run, any addition run would required recreating the entire KedroSession which may be kind of slow. Do we need some kind of function like KedroSession.reset() that simply reset the flag & timestamp?

@ARashitov
Copy link
Author

ARashitov commented May 24, 2022

@noklam Response in 2 discussion points:

1. Yes, it remains. session_id is gets passed to KedroSession.create() class method returning the instance of KedroSession with custom session_id which further gets used in expression save_version = session_id

Ensuring assumption of 1 session = 1 run

I'm attaching screenshot with example of mine implementation of KedroSession I use in Jupyter notebook for pipelines debugging purposes where i ensuring assumption of 1 session = 1 run.
Screen Shot 2022-05-24 at 22 16 22

  • note: You may notice extra parameter: extend_catalog in session.run(). It actually uses my customized KedroSession class with extra extend_catalog: Dict[str, AbstractDataSet] = None argument I use to dynamically populate DataCatalog instance with request body within session.run()

2. I'm happy with curreny performance and yes, i recreate entire session for each endpoint call. Later, my plan is to reallocate kedro runs to celery, but it requires some additional experiments from my side. Thanks, for your willingness to help

Appendix A | init_kedro_session()

def init_kedro_session(trace_id: str, env: str = ENVIRONMENT) -> KedroSession:
    """Function performs extraction of kedro context using built-in tools

    Args:
        env (str): input environment

    Returns:
        KedroContext: Generated kedro context
    """
    try:
        # 1. Bootstrapping project to find main path
        startup_path = Path.cwd()
        project_path = _find_kedro_project(startup_path)
        metadata = bootstrap_project(project_path)

        # 2. Initlize session & create context
        session = KedroSession.create(
            package_name=metadata.package_name,
            project_path=project_path,
            trace_id=trace_id,
            env=env,
        )
    except Exception as kedro_init_exc:
        message = f"failed to initilize kedro session: {kedro_init_exc}"
        raise get_http_exception(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            message=message,
        )

    return session

@merelcht
Copy link
Member

This is an interesting use case, and I think it's definitely worth discussing ways for users to control the session_id with the team. The KedroSession is still one of the more unstable components in Kedro, so I'd like to investigate more what potential use cases this would solve, before adding another argument to the user facing create method. One concern I have now is that we'd need to be careful about the format of session_id the user can provide. Currently, the session_id will be taken as the save_version as well and if this doesn't comply with the following rules, versioning will be broken:

a) be a case-insensitive string that conforms with operating system filename limitations, 
b) always return the latest version when sorted in lexicographical order.

Thanks for opening up this PR @atmosone, it might not be merged soon, but it's definitely food for thought.

@datajoely
Copy link
Contributor

One concern I have now is that we'd need to be careful about the format of session_id the user can provide. Currently, the session_id will be taken as the save_version as well and if this doesn't comply with the following rules, versioning will be broken:

a) be a case-insensitive string that conforms with operating system filename limitations, 
b) always return the latest version when sorted in lexicographical order.

Thanks for opening up this PR @atmosone, it might not be merged soon, but it's definitely food for thought.

On these constraints, I was thinking we could allow the user to specify a session_id suffix which is appended to the end, this would meet requirement (b) and we could validate to ensure (a) is met.

@ARashitov
Copy link
Author

@MerelTheisenQB @datajoely Thank you!
@MerelTheisenQB Where I could read more about requirements like the ones you sent & more about session versioning?

@noklam
Copy link
Contributor

noklam commented May 25, 2022

Defintely a topic that need more discussion. 🙂

@noklam
Copy link
Contributor

noklam commented May 25, 2022

@atmosone There are no specific documents, you have to look into source code for this. This is more about how KedroSession trigger a run and how versioning works.

@merelcht
Copy link
Member

@atmosone There are no specific documents, you have to look into source code for this. This is more about how KedroSession trigger a run and how versioning works.

@atmosone, Nok is right, we don't have any specific docs about this currently. The main reason for this is that we are still designing the KedroSession and the version we have right now will likely change in the future. We would advise users not to implement anything that heavily relies on what the KedroSession is now, unless of course you make your own version that can stand alone from Kedro. KedroSession is really more of a "back of house" component and as such not (yet) ready to be heavily used by users other than the framework developers.

@merelcht
Copy link
Member

Closing this PR as it's very out of date with our main branch now. We'll take the suggestions and ideas here into account when redesigning how Kedro Session works. I'd suggest to anyone who wants to continue the conversation to create an issue or join our Slack and chat there 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

blacken-docs missing python failure
4 participants