-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel pod into AWS Fargate #792
Comments
I have edited jupyter notebook variable as you have discussed here and im able to launch pod. But during pulling image on pod, jupyter notebook shows me kernel-error. Its looks like that JEG is not waiting for pod set up. After pulling image, my pod is ready but i have kernel-error message and kernel-dead icon |
Hi @lucabem - I'm sorry for the frustration about kernel-launch-timeout. I'd be inclined to make it an EG-only value (rather than from the client), but the problem is that it needs to include the request timeout (e.g., KERNEL_LAUNCH_TIMEOUT <= EG_REQUEST_TIMEOUT), that MUST be set in the client to keep the connection open to EG long enough. Your observation is correct, EG doesn't have any knowledge that it's dealing with k8s, docker, YARN. It just polls for the kernel's readiness. In addition, auto-scaling can be problematic (as discussed in the link you provided) and that's something the KernelImagePuller doesn't really address very well.
Are you creating a new node per notebook server instance, or per notebook instance (i.e., per kernel)? In either case, may I ask why? If the new node is hosting a notebook server, and then pods are launched from that, I would recommend you use JupyterHub to launch the notebook server since they have richer support for node management - although I'm not sure they address this particular use-case. I think your only recourse is to make KERNEL_LAUNCH_TIMEOUT sufficiently long to include both node creation, image pull, and kernel launch. One bright note is that it looks like we might have asynchronous kernel startups in the Jupyter stack soon (next few weeks), so these increased start times will not impact the next kernel start request. |
Im spawning kernels into kubernetes cluster. With AWS Fargate, we create one node per kernel-pod. The main purpose of using AWS Fargate is the cost. You just paid for CPU used meanwhile with typical nodes, you will pay no matter if you are not using CPU. Its quite strange because, it gives me kernel-error but my pod works correctly. Its fault of jupyter-notebook and not of enterprise-gateway i guess. |
No - it's not notebook. I'm sure the kernel startup is timing out because the time to create the node, the pod AND download the image, is exceeding the timeout. At least that's what I surmise without having seen the error. Could you please provide the log out that includes the complete startup sequence (from the initial message to the error)? Again, why are you creating an entire node that hosts a single kernel-pod? That seems extremely wasteful. Is that just how Fargate works?? Are there ways to pre-create nodes - and provide a list of images that exist when the pod creation request occurs? |
Yes, this is how it works. The benefit is that you just pay for resources you use |
So are all pods running in their own individual node? |
Yes |
Bummer. Can you please provide the complete set of log statements that correspond to the start-kernel request? |
The problem is that i don have any error on enterprise-gateway logs. I have modified jupyterhub to allow While enterprise-gateway is waiting to connect to pod (pod is pulling image so is pending status), on jupyterhub log, i get this error:
The begining of start-request
Thats the log of JEG
|
Thanks for the additional information along with the EG output - that's helpful. So you wind up with a running pod (node) for that kernel, and EG knows about it, but the notebook doesn't because it never completed the request. Is that correct? A few more questions... What version of notebook are you using? |
Yes, it looks like that. It seems that notebook doesnt wait for kernel set up
Im using via --gateway-url
6.0.0 |
ok - thanks. So the request timeout should be getting set to 600+2, but the Can you try to determine - perhaps by enabling This option can also be set via |
Yes, it was around 60 seconds. |
Cool - there's hope then. I don't think it matters that you're using Hub other than a possible inconvenience to set options (which is why I added the comment about using the command line option). However, I should have stated it this way: |
Yep, I have tested that var but it doesnt work. It seems that it doesnt work. I dont know why jupyter notebook doesnt override that default value. |
Setting these can be tricky. If possible, it might be worth trying to either temporarily modify the default value (to see if we're barking up the right tree with this), or add a log statement just prior to the fetch call that dumps Sometimes, I'll try setting these to stupid short values, like 1 or 2, expecting the request to fail in that amount of time. If the request continues for the default time, then I know my setting didn't work. |
Thanks for the advice! - Tomorrow I will try to modify the notebook package for debug.
Do you mean to print (***kwargs) |
Yes, print or log the kwargs. I'm not sure how much the notebooks docs will help, but they do list all the config options. Please note that in their docs, the data type of the config option is merged (appended) to the name of the option. You can get clearer text by running
Good luck:🤞 |
I have modified to use KERNEL_LAUNCH_TIMEOUT env value to 5 seconds and it crash after this 5 seconds, but when I give it a more than 60secs, it crashes at sec 60. For example, using this command:
I get this error. It seems to have a max of 60 secs. If a use 70 sec for example, it crashes at 60 sec.
|
Yes, we know KERNEL_LAUNCH_TIMEOUT is being set correctly - that's not the issue. The issue is with the connect timeout option. It defaults to 60 seconds. That's the one that should be set very low (while the others remain high) to see if your new value is taking effect. KERNEL_LAUNCH_TIMEOUT should be viewed as an EG-side value. We know that it is working because you didn't get a timeout in the 7-8 MINUTES it took to start the kernel from EG - despite the front-end timing out. The request and connect timeout values are Client-side values, in that they affect the actual connection between the client (notebook) and EG. However, even if we find a way to extend the connection timeout correctly, I don't think AWS Fargate is a viable environment for kernels on k8s simply due to the startup times. You cannot ask anyone to wait 8 minutes when launching a notebook (starting a kernel). That just isn't acceptable. I will try to read up on Fargate, but if they are truly creating blank nodes per pod, I would imagine all applications have this issue. Only those applications that can "preload" pods would have any chance of working. If we can't figure a solution to the 8-minute startup issue (even after solving the connection issue) and if you must use Fargate, I would recommend you use JupyterHub and use kernels local to the Notebook pod instance. Then your 8-minute startup issue moves to start the Notebook server but not each kernel - which is more tolerable (although also obnoxious). |
Yes, I understand your concern. My idea is to create a hybrid platform, where there are kernel-pods that run on ec2 instances (worker nodes) and for exceptional cases that require more power, use fargate. With this solution we would reduce costs since there would not always have to be an instance (worker-node) that has many resources (each instance is paid per hour while at fargate you pay for the use of cpu) running. |
ok - interesting - thanks for the explanation. So for those exceptional cases, an 8 minute start time is part of the "cost" of using expensive resources? Let's continue to figure out where the timeout is getting imposed and working around that. |
Hi @kevin-bates! Could be possible that JEG is blocked while pod creation and thats why we are getting a connection timeout from Notebook? I mean, while pod is pulling image JEG cannot respond notebook requests. |
This kind of delay is no different than the delay imposed waiting for the kernel's startup in YARN or regular k8s. It's just that it takes a couple of orders of magnitude longer when image pulling is in play. Once the async changes are released down the stack (which should be soon), JEG will be able to respond to other start requests, but I don't think the current behavior is causing issues within the single start request (other than it taking too long). At least we must assume that's the case until we determine where the connect timeout is coming from. Have you had a chance to set |
This combination of valus works:
We just need to define the enviroment var KERNEL_LAUNCH_TIMEOUT because there is no option to include it via I have tested it on my local cluster removing kernel-py image. |
That's great news. Sorry about the need to set It's still unfortunate that startup times on Fargate will always incur the cost of pulling the kernel image (and starting a new node)! As a result, I don't think we can claim support for Fargate. |
I will close that issue. Regarding Fargate, the image pull times are not as high as we saw in the logs (9 minutes). Also, if someone is reading this issue I have to say that if you are using the LoadBalancer type in AWS, you have to change the connection timeout (default is 60 sec) |
Prior to this change, the request timeout for a Gateway request was synchronized with KERNEL_LAUNCH_TIMEOUT only if KLT was greater. However, the two are closely associated and KLT should be adjusted if the configurable request_timeout is greater. This change ensures that the two values are synchronized to the greater value. It changes the two configurable timeouts to default to 40 (to match that of KLT) and removes the 2-second pad, since that wasn't helpful and only confused the situation. These changes were prompted by this issue: jupyter-server/enterprise_gateway#792
…#5317) Prior to this change, the request timeout for a Gateway request was synchronized with KERNEL_LAUNCH_TIMEOUT only if KLT was greater. However, the two are closely associated and KLT should be adjusted if the configurable request_timeout is greater. This change ensures that the two values are synchronized to the greater value. It changes the two configurable timeouts to default to 40 (to match that of KLT) and removes the 2-second pad, since that wasn't helpful and only confused the situation. These changes were prompted by this issue: jupyter-server/enterprise_gateway#792
Prior to this change, the request timeout for a Gateway request was synchronized with KERNEL_LAUNCH_TIMEOUT only if KLT was greater. However, the two are closely associated and KLT should be adjusted if the configurable request_timeout is greater. This change ensures that the two values are synchronized to the greater value. It changes the two configurable timeouts to default to 40 (to match that of KLT) and removes the 2-second pad, since that wasn't helpful and only confused the situation. These changes were prompted by this issue: jupyter-server/enterprise_gateway#792
Prior to this change, the request timeout for a Gateway request was synchronized with KERNEL_LAUNCH_TIMEOUT only if KLT was greater. However, the two are closely associated and KLT should be adjusted if the configurable request_timeout is greater. This change ensures that the two values are synchronized to the greater value. It changes the two configurable timeouts to default to 40 (to match that of KLT) and removes the 2-second pad, since that wasn't helpful and only confused the situation. These changes were prompted by this issue: jupyter-server/enterprise_gateway#792
Hi @kevin-bates. I am trying to spawn kernel'pods into AWS Fargate.
AWS Fargate works as a normal kubernetes worker node, so i think it could be possible.
The problem is that each time we want to spawn a notebook, we create a node. Because of this, I am having an error starting the kernel.
This is because I have to pull the elyra/kernel-py image. I have tried to modify variables EG_LAUNCH_TIMEOUT and KERNEL_LAUNCH_TIMEOUT but I always get a timeout in second 40.
Are there any other variables to indicate that I keep waiting for the image to pull, for example 10 minutes, until I launch the kernel-error?
[E 2020-03-19 09:28:30.158 EnterpriseGatewayApp] KernelID: 'c33d05ea-0da9-40c4-86ae-fc068559ae95' launch timeout due to: Waited too long (40.0s) to get connection file
Enterprise-gateway.yaml
Kernel.json
kernel-pod.yaml-j2
I have modified this method to print self.kernel_launch_timeout and i get that:
Its looks like pod doesnt care about env variables
KERNEL_LAUNCH_TIMEOUT
andEG_KERNEL_LAUNCH_TIMEOUT
Version: 2.1.0
The text was updated successfully, but these errors were encountered: