-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KED-2004] Manage hook_manager
lifecycle in session
#1153
Conversation
hook_manager()
lifecycle in session
# set up all hooks so we can discover all pipelines | ||
hook_manager = get_hook_manager() | ||
_register_hooks(hook_manager, settings.HOOKS) | ||
_register_hooks_setuptools(hook_manager, settings.DISABLE_HOOKS_FOR_PLUGINS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this logic to happen when a new session is instantiated instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My theory is that this is why the code fails when running with ParallelRunner
. When we use ParallelRunner
, we make sure each subprocess looks the same, even in spawn
mode. That's why we call _bootstrap_subprocess
, which configures a) the logging to be same and b) the project, via configure_project
. configure_project
also ensured that the hook manager has all the right hooks set up. Now that we don't do this here anymore, I believe the code should be replicated in _bootstrap_process
.
Not entirely sure why the pluggy
hook manager doesn't get pickled/unpickled to the same object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is exactly what's happening! The reason why is that even though on first try it looked like the PluginManager
could be pickled with the latest pluggy
version, that wasn't actually the case.
hook_manager()
lifecycle in sessionhook_manager
lifecycle in session
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
cf8979a
to
ef49220
Compare
For transparency I'll leave here the same comment from our conversation:
TL;DR: I agree that maybe it's not the best time to make this change, and we can revisit at another time, but more for conceptual reasons rather than it actually being a sore problem. |
I've spent a while thinking about this and I am also not sure about several things... So this isn't going to be a very useful review, but I'll just list my various questions and comments here.
Edit: looks like Lorena commented at exactly the same time as me. Agreed with everything she says 👍 Having one manager per session makes sense to me also, but I don't immediately see a nice way of doing it. |
Thanks both for your thoughts on this! @AntonyMilneQB on this one:
the problem is that indeed the code that fetches the hook manager (e.g. |
My 2 cents is this is the conceptually correct thing to do. There used to be 2 kinds of hooks: registration & life-cycle. Managing them using the same hook managers was a mistake. Now that registration hooks are gone, scoping the life-cycle hook manager to a session makes sense because:
|
@limdauto what you say definitely makes sense. Given that |
@AntonyMilneQB: Step 1: Add |
@limdauto makes sense, but that would mean passing
that sounds a bit strange to me, but I just might be misunderstanding what should have access to what. 🤔 |
I think the long-term solution is to create the Nevertheless, I did a quick search here in GitHub where So if we were to map out all the actors here, we have:
One thing we can notice here is that the lifecycles of both the No other actors should be involved in this, unless they are also clients of the object. Eliminating the initialiser and tying the lifetime of the hook manager with the one of All of this is subject to I think we should do these changes now in order not to kick the can down the road to 0.19, since that'd be breaking changes. FOLLOW UP: |
@idanov very well written. Just want to point out that we already have 2 pluggy manager instances: one for CLI hooks and one for lifecycle hooks, so all this is completely possible. |
… + runner methods Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
kedro/runner/parallel_runner.py
Outdated
hook_manager = create_hook_manager() | ||
_register_hooks(hook_manager, settings.HOOKS) | ||
_register_hooks_setuptools(hook_manager, settings.DISABLE_HOOKS_FOR_PLUGINS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a new hook_manager
when doing multiprocessing, because the PluginManager
can't be serialised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PluginManager
can't be serialised
Is that why we need it in fork
mode as well? Or can we move it up under the if
branch?
Also we should also update the docstrings, they still mention "activating the session" which we don't do anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the problems from PluginManager
not being serialisable happen in all modes, because we try to do the serialisation further up the stack. I tested this by creating a PluginManager
and calling pickle.dump()
on it and that fails.
And yes good point on the docstring!
"_get_pipelines_registry_callable", | ||
return_value=mock_get_pipelines_registry_callable, | ||
) | ||
return mock_get_pipelines_registry_callable() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some cleanup here, because this fixture already exists in conftest.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOVE this! ❤️ 😍 Fantastic job!! 🔥 👏 👏 👏
Don't forget to add a few lines in the release notes about the breaking changes, like different signatures, public API, and the fact that the hook manager is no longer global, but unique per session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work. Love the fact that we are now on pluggy 1 as well 🎉
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work, especially getting it working with the parallel runner!!
Generally looks 🌟 but I just have a few questions which might be best answered by @idanov actually:
-
Why do we actually need
_clear_hook_manager
at all - can't we get rid of it entirely? Since the hook manager is now contained within a session then I don't see why we would need to clean anything up, so we could simplify this even more. If I understand correctly, this is what Ivan meant when he previously said "Eliminating the initialiser and tying the lifetime of the hook manager with the one of KedroSession will hopefully be sufficient to remove the need to clear the hooks on session closure or IPython reloading - who cares if those hooks are registered in an object which is no longer being used ever?" -
The current clear up strategy seems inconsistent in ipython, since
session.close()
doesn't get called there. Either we care about clearing the hooks manager or we don't, but I think we want the same behaviour within akedro run
andkedro ipython
? If, as I suspect, we don't in fact need_clear_hook_manager
at all then it's fine that we don't do any hook manager clean up in ipython as how you currently have it, but I wonder if it's still worth putting insession.close()
just for consistency (since this also calls_deactivate_session
and potentially saves to the session store). -
If in the future we do this to pass the session to hooks, does it mean passing
session
toRunner
as well? Adding both a hook manager and session arguments to the runner somehow feels a bit bloated to me, especially given these arguments get cascaded down torun_node
etc. I'm perfectly happy with the changes made here to and have no better method to propose, but just wondering where this might take us in the future.
Yes, I think you're right, I forgot about Ivan's comment. I will remove this.
Where would you call
I think we need to discuss and design the proposal of passing the session to hooks in more detail. I don't really like the idea of passing the |
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
…edro into KED-2004-hook-manager
Oh yes, of course. I was thinking because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⭐ 🌟 ⭐ 🌟 ⭐ 🌟 ⭐ 🌟 ⭐ 🌟 ⭐
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
…edro into KED-2004-hook-manager
Description
Back in August 2020 it was discovered that the global
hook_manager
could have out of date hooks. To resolve this_clear_hook_manager
was added to the tests and theipython
workflow. It was also suggested at the time that when theKedroSession
was finished it might be a good idea to have ahook_manager
per session.Development notes
Changes I made:
configure_project
call (which was called from e.g.KedroSession.create()
,bootstrap_project
), this now happens when a newKedroSession
is instantiated.KedroSession
is closed the hook manager gets cleared.KedroSession
passesself._hook_manager
toKedroContext
andRunner.run()
Problem with this implementation
This implementation works fine, apart from the case where you use the
ParallelRunner
and have a plugin installed with hook implementations. If you just create custom hooks inside your project it does work, but not when these hooks come from an installed plugin. And the plugin hooks work fine if you don't use theParallelRunner
. I've added a test that uncovers this issue.The reason why this is happening is because the
PluginManager
isn't serialisable. It will be set toNone
when it gets pickled, which causes issues down the line.👩🏼🔧. Fix: The solution to the above issue is to create a new
hook_manager
instance when using multiprocessing in theParallelRunner
.Checklist
RELEASE.md
file