Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at making RemoteKernelManager independent #810

Merged
merged 1 commit into from
May 20, 2020

Conversation

golf-player
Copy link

This is a WiP.

So first pass, I just wanted to get things working, and they do (with nbclient, not sure if gatewayapp still works, but it should). There's way too much asking for permission in the code right now, and there's probably a lot that could be done to separate code where mappingkernelmanagers are needed. Another option is to make a single kernel mapping manager (AKA a completely useless stupid class) just for convenience.

What do you think about doing traitlets like this? To properly share it with EG App, I'll need to add this mixin and remove the included traitlets from EG App.

@kevin-bates
Copy link
Member

Thanks for submitting the pull request.

I'd rather not take the trailet approach as this will break existing configurations. Is there an issue with a property/accessor approach?

There's way too much asking for permission in the code right now

Can you elaborate on this? Are you referring to if statements?

Another option is to make a single kernel mapping manager (AKA a completely useless stupid class) just for convenience.

You're welcome to use something like this in your wrapper code, but probably doesn't belong in EG. So the idea would be to set a placeholder for parent and parent.parent items?

I think the kernelspec issue you brought up on #803 is something we need to address before proceeding.

@golf-player
Copy link
Author

I guess the only problem with the property approach is that it's a lot of repeated code. How does using traitlets like this break configuration? I'm not too familiar with traitlets.

Can you elaborate on this? Are you referring to if statements?
Yeah basically there's too many points where we're checking if a property exists before doing something, when IMO there should be a clear divide between things that use a MultiKernelManager (MKM) and not (of course, this may be less useful if my use case isn't a common one). Or maybe that code belongs in the kernel_manager class.

So the idea would be to set a placeholder for parent and parent.parent items?

The idea there would be to continue as is, removing parent and parent.parent items. parent is an MKM, so making something with the same interface, which basically is no-op on things like kernel connection count increments and then setting that as a property on RKM, which in this pull request is RKM.multi_kernel_manager. that code could live anywhere, true.

I think the kernelspec issue you brought up on #803 is something we need to address before proceeding.

The problem being that the processproxy RKM uses is ignoring config from the kernelspec, right?

@golf-player
Copy link
Author

One thing I don't like is calling out to uuid.uuid4, but I see that done in several other places as well. Do you think it would make sense to make a function for everything to use, or keep it here, since technically the way RKM sets kernel_id should be agnostic of everything else.

@kevin-bates
Copy link
Member

Do you mind moving the conversation back to the issue? I'm getting scattered with a bunch of other stuff going on and I'm kinda wondering if we should try to jump on a web session to get on the same page here. Would you be available for something like that? If so, please drop me a note on my github email and I'll forward you a link.

@golf-player
Copy link
Author

I was going to ask about switching back to the issue haha.

I'll send you an email

@golf-player
Copy link
Author

TODO: Fix tests, add tests, squash into nice commit

@golf-player
Copy link
Author

golf-player commented Apr 28, 2020

OK got it to work with mixins. A tad hackish maybe.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@golf-player - these changes look really good. I haven't taken things for a spin, but plan to once updates are completed. I really like the way this cleans up the readability despite the non-intuitive approach taken to accomplish that. Thank you.

@@ -56,7 +54,7 @@
})


class EnterpriseGatewayApp(JupyterApp):
class EnterpriseGatewayApp(JupyterApp, EnterpriseGatewayConfigMixin):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that mixin-kinds of classes should be declared prior to the true superclasses.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.



class RemoteKernelManager(IOLoopKernelManager):
class RemoteKernelManager(IOLoopKernelManager, EnterpriseGatewayConfigMixin):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as previous - declaration order.

@@ -224,6 +233,27 @@ def __init__(self, **kwargs):
if hasattr(self, "cache_ports"):
self.cache_ports = False

if not self.connection_file and not self.kernel_id:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe self.kernel_id will always be None since this is the initializer and we set it to None above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. I'll remove the line setting it to None since it doesn't really help.

@@ -423,3 +455,24 @@ def _get_process_proxy(self):
format(self.kernel_spec.display_name, process_proxy_class_name))
process_proxy_class = import_item(process_proxy_class_name)
self.process_proxy = process_proxy_class(kernel_manager=self, proxy_config=process_proxy_cfg.get('config'))

@property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a comment block somewhere (thinking prior to these properties) that talks about when self.parent would not exist. This might warrant an entry in the docs as well - although I don't mind keeping this on the down-low for the time being. 😉

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Is there somewhere specific in the docs where something like this might go?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay deferring this for now, although if we were to add it, the following locations might be worthy of updates:

I'm trying to think of what we'd call the topic. The term "Library Mode" popped into my head just now. I'm not enamored with that, but I'd like the topic to convey a more direct capability that utilizes the heart of EG. "Integration Mode" - something that allows third-parties to use RMK directly. Ideas welcome!

The topic should include an example - much like the one you opened the issue with - based on the final results. (Btw, I'd like to have those details anyway so I could try things out against my Hadoop YARN cluster.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like library mode. Just anything that conveys the meaning that this is JEG but without the gateway webapp.

Standalone Remote Kernel Execution if verbosity isn't an issue.

The example actually becomes really simple.

import nbformat
from nbclient import NotebookClient
from enterprise_gateway.services.kernels.remotemanager import RemoteKernelManager

with open("/home/ish/notebooks/Untitled.ipynb") as fp:
    test_notebook = nbformat.read(fp, as_version=4)

client = NotebookClient(nb=test_notebook, kernel_manager_class=RemoteKernelManager)
client.execute()

If you'd like a full docker with kernelspecs and config, I can do that, but I don't have experience with Yarn so it may take a while.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like "Standalone Remote Kernel Execution" as the topic. Perhaps in the opening statement, we can toss in an "a.k.a., Library Mode". 😄 This way we have a way to quickly identify the context when communicating this functionality or discussing it in general.

The sample is excellent - I'm really happy with how this is turning out. Let's not worry about a docker image for the moment. Given the example, I can run this against my YARN cluster, although probably not today (sorry).

@kevin-bates
Copy link
Member

Hi - I finally got around to trying this, and I'm not able to make it off the launch pad. Using nbclient 0.1.0 (0.2.0 switches to async kernel management, which EG isn't able to move to yet due to notebook), I get the following because self.kernel_spec_manager is None.

python -m indieRMK.py 
Traceback (most recent call last):
  File "/opt/anaconda2/envs/py3/lib/python3.6/runpy.py", line 183, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/opt/anaconda2/envs/py3/lib/python3.6/runpy.py", line 109, in _get_module_details
    __import__(pkg_name)
  File "/home/gateway/indieRMK.py", line 9, in <module>
    client.execute()
  File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/nbclient/client.py", line 357, in execute
    with self.setup_kernel(**kwargs):
  File "/opt/anaconda2/envs/py3/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/nbclient/client.py", line 339, in setup_kernel
    self.start_new_kernel_client(**kwargs)
  File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/nbclient/client.py", line 312, in start_new_kernel_client
    self.km.start_kernel(extra_arguments=self.extra_arguments, **kwargs)
  File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 274, in start_kernel
    self._get_process_proxy()
  File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 458, in _get_process_proxy
    process_proxy_cfg = get_process_proxy_config(self.kernel_spec)
  File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/jupyter_client/manager.py", line 84, in kernel_spec
    self._kernel_spec = self.kernel_spec_manager.get_kernel_spec(self.kernel_name)
AttributeError: 'NoneType' object has no attribute 'get_kernel_spec'

Are you using a config file of any sorts to set that up? I was able to run EG fine with this build, but get this in 'library mode'. My driver is essentially identical to your example code.

Also, I think we're going to want some form of documentation for this. We're you still planning on adding that?

@golf-player
Copy link
Author

@kevin-bates apologies for getting back to you so late. Got wrapped up in another project. I will have time this week to debug this issue. I'm guessing it's the nbclient version difference; I'm using the latest nbclient (which works I guess surprisingly, given JEG isn't async)

What's the timeline for the JEG deploy btw? I may be able to reprioritize things depending in order to have this ready by then if possible.

I will be writing documentation for this.

@kevin-bates
Copy link
Member

Hi @golf-player - thanks for the update. Regarding the timeline, as soon as Notebook 6.1 is available, we will merge #794 and start building the next EG release. I'm hopeful that some of the front-end notebook issues will be reviewed/merged in the next week or two so that release can move forward.

I would love to get this into the next EG release, so, should things get to that point, I don't mind waiting to ensure this is ready, although I don't expect that to be necessary.

@golf-player
Copy link
Author

golf-player commented May 15, 2020

Looks like nbclient 0.3.0 also just dropped. I've gotten both 0.3.0 and 0.1.0 to run the kernel, but there's another issue creeping up on both versions where they're not shutting down the kernels after. These may unfortunately need to be addressed on their end, but I'm still looking into it.

Will update the MR for now with the changes to just add a kernelspec_manager in the constructor if one isn't gotten from the parent (RMKM explicitly passes a kernelspec_manager)

@kevin-bates
Copy link
Member

@golf-player - thanks for the updates. I took this for another spin against my YARN cluster and got the following...

Traceback (most recent call last):
  File "/opt/anaconda2/envs/py3/lib/python3.6/runpy.py", line 183, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/opt/anaconda2/envs/py3/lib/python3.6/runpy.py", line 109, in _get_module_details
    __import__(pkg_name)
  File "/home/gateway/indieRMK.py", line 9, in <module>
    client.execute()
  File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/nbclient/client.py", line 357, in execute
    with self.setup_kernel(**kwargs):
  File "/opt/anaconda2/envs/py3/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/nbclient/client.py", line 339, in setup_kernel
    self.start_new_kernel_client(**kwargs)
  File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/nbclient/client.py", line 312, in start_new_kernel_client
    self.km.start_kernel(extra_arguments=self.extra_arguments, **kwargs)
  File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 279, in start_kernel
    self._get_process_proxy()
  File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 468, in _get_process_proxy
    self.process_proxy = process_proxy_class(kernel_manager=self, proxy_config=process_proxy_cfg.get('config'))
  File "/opt/anaconda2/envs/py3/lib/python3.6/site-packages/enterprise_gateway/services/processproxies/yarn.py", line 44, in __init__
    kernel_manager.parent.parent.yarn_endpoint)
AttributeError: 'NoneType' object has no attribute 'parent'

I went ahead and added a few more entries to dependent_props and was then able to run a notebook using nbclient. I had nbclient 0.1.0 installed and noticed the kernel had not been terminated. However, after updating to 0.2.0, I found the kernel terminated, but with a state of KILLED...
image
when "graceful" termination should produce a state of FINISHED. This implies that the something isn't quite right with the shutdown sequence. You might not see this using k8s, but it would be good to understand the why the message-based termination is not occurring.

In the YARN application log, we should see a shutdown indicator like the last line here. Instead, we see all but that line.

Log Type: stdout
Log Upload Time: Sat May 16 07:20:33 -0700 2020
Log Length: 1772

[D 2020-05-16 07:17:44,825.825 launch_ipykernel] Using connection file '/tmp/kernel-6e7f4f81-1ffc-4c09-8a3b-0920f7fd2272_I_lTUb.json'.
[I 2020-05-16 07:17:44,826.826 launch_ipykernel] Signal socket bound to host: 0.0.0.0, port: 50501
[D 2020-05-16 07:17:44,830.830 launch_ipykernel] JSON Payload '{"stdin_port": 44132, "pgid": "3249", "ip": "0.0.0.0", "pid": "3299", "control_port": 47874, "hb_port": 40748, "signature_scheme": "hmac-sha256", "key": "cfc3a76f-964f-4f87-9bc8-1cbaaf8dd39f", "comm_port": 50501, "kernel_name": "", "shell_port": 53084, "transport": "tcp", "iopub_port": 39304}
[D 2020-05-16 07:17:44,836.836 launch_ipykernel] Encrypted Payload 'u5T6rmMK+4S6e659vVYcSooB9AA+yQhmrIXbTXBaMBdNjVGV3KaNTi1wHvxxLHZ/4J4xnw3j8ARbyEx0qEJ53LRUsphakrvh+IG2UtRloSgR9oBO+tUKf9LfXC0XHzorxuEdTr/byIEcQ6ChHgqqkxT4qnzum/TASyf8+MV3LBxW5M6gteoeTFeibvqD4hKa6UQm/A7+QT+aabxP5/flD1hPXPA+2AbDwSLyoLQpu+wHagihjEH6BqcH9YfuZ7ihErRQp4xb0ignUZRu7xvdvpg+KntBzq/rWvnlmLP+t9SUnc2/72pEJ6cYxat3SLpfg8ZsHUzhBlk8lJVlo9l0j6dqX7t+jDNzcUEwj/bL2Mk0E0RlQseb5IqDWcKYzjJjpYaooA8aOieLM3+zm6t8hg==

[I 2020-05-16 07:20:33,493.493 launch_ipykernel] gateway_listener got request: {u'shutdown': 1}

I've pushed the parent.parent removal to my repo - feel free to grab them or I can push to yours. https://github.com/kevin-bates/enterprise_gateway/tree/remove-parent-parent

@kevin-bates
Copy link
Member

kevin-bates commented May 18, 2020

Looking at nbclient 0.2.0, if I add this line following L335

await ensure_async(self.km.shutdown_kernel(now=False))

I get a graceful shutdown with a FINISHED status.

It seems like nbclient should also allow the kernel-manager to perform a graceful shutdown prior to forcing it.

@golf-player
Copy link
Author

golf-player commented May 19, 2020

Sorry about that, I didn't go through the other processproxy subclasses. I kind of assumed the baseprocessproxy would be enough

Seems like the problem here is that graceful shutdown (kernel client shutdown) isn't shutting down the process.

Looking into it.

@kevin-bates
Copy link
Member

The process is shutdown but nbclient isn't giving the kernel manager an initial chance to do it gracefully. This is a difference in behavior from what notebook does.

@golf-player
Copy link
Author

golf-player commented May 19, 2020

https://github.com/jupyter/nbclient/blob/0.2.0/nbclient/client.py#L334-335

Here the client is given the chance to shutdown gracefully rather than the manager (this appears to do the same thing as the manager's graceful shutdown)

@kevin-bates
Copy link
Member

Right. This was my very comment from earlier today: #810 (comment)

@golf-player
Copy link
Author

Sorry I'm a bit confused here. Is self.kc.shutdown not the correct way to shut down a kernel?

@kevin-bates
Copy link
Member

That sends the shutdown message to the kernel, but we leverage the kernel manager for interacting with the hosting resource manager and the shutdown method on the kernel manager includes both graceful and immediate options.

For the graceful route we send a comparable shutdown request to the listener but for immediate we enlist the help of the resource manager to do whatever is necessary. On YARN that equates to calling their KILL API.

@golf-player
Copy link
Author

So in the case of a docker container running a kernel process, the kernel gets the shutdown message, and at that point the process should exit 0, stopping the running container.

I guess what's happening here is that the kill request happens while it's shutting down. I'm still trying to see that that's what's happening in the logs

@golf-player
Copy link
Author

Yeah that's what's happening; the shutdown message is received, then the pod gets killed.

@kevin-bates
Copy link
Member

We have the listener as well. It's a sibling to the kernel to handle things like interrupts. I haven't looked at the container process proxies lately but they probably terminate the container when the immediate route is taken. This is probably ok in container land but not in YARN where the status is killed rather than finished.

@golf-player
Copy link
Author

i think this warrants a discussion of its own with the nbclient devs. nbclient 0.3.0 I believe also does a force shutdown

This is probably ok in container land but not in YARN where the status is killed rather than finished.

Agreed. distributed mode as well. And even in container land, if it's being KILLed, it's possible some things don't stop gracefully, like shutdown hooks

@golf-player
Copy link
Author

golf-player commented May 19, 2020

jupyter/nbclient#60 for that.

These changes seem to be from the nbclient side. Documentation, IMO, should only be merged after nbclient supports clean shutdowns. The rest can go in after I fix the tests. What do you think?

@kevin-bates
Copy link
Member

Thanks @golf-player. I guess I'm okay with the docs. They're really just an example and the code won't change even after nbclient fixes things. At most I guess we'd update the nbclient version in the example, but since there's already 0.3.0, even that isn't a big deal.

I guess one of us should submit a PR to nbclient if we don't hear back from them soon.

Aside from that, how are we doing here? Was the conductor endpoint commit the test-fix you were referring to?

@golf-player
Copy link
Author

Well as of nbclient 0.3.1, it works at least as well as 0.2.0, so that's good.

I'm working on a rough implementation to get some conversation going with them.

I think this PR is looking done more or less. Yeah it was a minor config change breaking tests, but no issues now.

@kevin-bates
Copy link
Member

Sounds good. Thanks for opening the nbclient PR.

I just ran this both as EG and Independent RMK (like Independent George) and things look great. I plan on performing a squash and merge first thing tomorrow. If you'd prefer to squash to get the comment as you'd like please feel free, otherwise, I'll likely leave the meaningful commit comments in place or derive one myself.

I think is going to be really useful! Thank you.

@golf-player
Copy link
Author

I've pushed a squashed commit.

Thanks so much for all the help! Hopefully the nbclient PR goes well.

Make RemoteKernelManager (RKM) able to be run without a (grand)parent
EnterpriseGatewayApp (EGA) instance so it can be used by nbclient.

The config which RKM with EGA was being referred to using traitlet
lineage like `self.parent.parent.property_name`. Change both to inherit
from a Configurable mixin which contains all of EGA's previous config.
Link the attributes of the RKM instance with EGA if available to keep
old behaviour.

Modify other properties RKM uses that are not traits to become
`@property`s which fall back to sane defaults if running independently.

Change RKM to be able to generate a kernel id if necessary. Change
ProcessProxy to use provided kernel ids. Move kernel id generation
logic out of RemoteMappingKernelManager.

Resolves jupyter-server#803
Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for this great pull request!

@kevin-bates kevin-bates merged commit b90d274 into jupyter-server:master May 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants